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Preface 



AI2000 was the 13th in the series of biennial Artificial Intelligence confer- 
ences sponsored by the Canadian Society for Computational Studies of Intel- 
ligence/Societe canadienne pour I’etude de I’intelligence par ordinateur. For the 
first time, the conference was held in conjunction with four other conferences. 
Two of these were the annual conferences of two other Canadian societies, Graph- 
ics Interface (GI 2000) and Vision Interface (VI 2000), with which this conference 
has been associated in recent years. The other two conferences were the Inter- 
national Symposium on Robotics (ISR 2000) and the Institute for Robotics and 
Intelligent Systems conference (IRIS 2000) . It is our hope that the overall expe- 
rience will be enriched by this conjunction of conferences. 

The Canadian AI conference has a 25 year tradition of attracting Canadian 
and international papers of high quality from a variety of AI research areas. 
All papers submitted to the conference received three independent reviews. Ap- 
proximately one third were accepted for plenary presentation at the conference. 
A journal version of the best paper of the conference will be invited to appear 
in Computational Intelligence. The conference attracted submissions from six 
continents, and this diversity is represented in these proceedings. The overall 
framework is similar to that of the last conference, AI’98. Plenary presentations 
were given of 25 papers, organized into sessions based on topics. Poster presen- 
tations were given for an additional 13 papers. A highlight of the conference 
continues to be the invited speakers. Three speakers, Eugene Charniak, Eric 
Horvitz, and Jan Zytkow, were our guests this year. 

Many people contributed to the success of this conference. The members 
of the program committee coordinated the refereeing of all submitted papers. 
They also made several recommendations that contributed to other aspects of 
the program. The referees provided reviews of the submitted technical papers; 
their efforts were irreplaceable in ensuring the quality of the accepted papers. 
Our thanks also go to those who organized various conference events and helped 
with other conference matters, especially Heather Caldwell, Helene Lamadeleine, 
and Bob Mercer. We also acknowledge the help we received from Alfred Hofmann 
and others at Springer-Verlag. 

Lastly, we are pleased to thank all participants. You are the ones who make 
all this effort worthwhile! 
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Abstract. The seminal works of Nilsson and Pearl in the 1970’s and 
early 1980’s provide a formal basis for splitting the field of heuristic 
search into two subfields: single- and two-agent search. The subfields are 
studied in relative isolation from each other; each having its own dis- 
tinct character. Despite the separation, a close inspection of the research 
shows that the two areas have actually been converging. This paper ar- 
gues that the single/two-agent distinction is not the essence of heuristic 
search anymore. The state space is characterized by a number of key 
properties that are defined by the application; single- versus two-agent 
is just one of many. Both subfields have developed many search enhance- 
ments; they are shown to be surprisingly similar and general. Given their 
importance for creating high performance search applications, it is these 
enhancements that form the essence of our field. Focusing on their gen- 
erality emphasizes the opportunity for reuse of the enhancements, allows 
the field of heuristic search to be redefined as a single unified field, and 
points the way towards a modern theory of search based on the taxonomy 
proposed here. 



1 Introduction 

Heuristic search is one of the oldest fields in artificial intelligence. Nilsson and 
Pearl [201 El] wrote the classic introductions to the field. In these works (and 
others) search algorithms are typically classified by the kind of problem space 
they explore. Two classes of problem spaces are identified: state spaces and 
problem reduction spaces. Many problems can be conveniently represented as a 
state space; these are typically problems that seek a path from the root to the 
goal state. Other problems are a more natural fit for problem reduction spaces, 
typically problems whose solution is a strategy. Sometimes both representations 
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are viable options. Problem reduction spaces are AND/OR graphs; AO* is the 
best-known framework for creating search algorithms for this class of problems 
j2[20]. State spaces are OR graphs; the A* algorithm can optimally solve this 
class of problems m- Note that a state space (OR graph) is technically just a 
special case of a problem reduction space (AND/OR graph). 

Since their inception, the notions of OR graphs and AND/OR graphs have 
found widespread use in artificial intelligence and operations research. Both areas 
have active research communities which continue to evolve and refine new search 
algorithms and enhancements. Of the two representations, the state space rep- 
resentation has proven to be the more popular. It appears that many real-world 
problem solving tasks can be modeled naturally as OR graphs. Well-known ex- 
amples include the shortest path problems, sliding-tile puzzles, and NP-complete 
problems. 

One application domain that fits the AND/OR graph model better is two- 
agent (two-player) games such as chess. In these games, one player chooses moves 
to maximize a payoff function (the chance to win) while the opponent chooses 
moves to minimize it. Thus, the AND/OR graphs become MIN/MAX graphs, 
and the algorithms to search these spaces are known as minimax algorithms. 
Curiously, it appears that two-player games are the only applications for which 
AND/OR algorithms have found widespread use. To contrast A*-like OR graph 
algorithms with two-player minimax algorithms, they are often referred to as 
single-agent (or one-player) search algorithms. 

With the advent of Nilsson’s AND/OR framework, two-agent search has 
been given a firm place within the larger field of heuristic search. Since AND/OR 
graphs subsume OR graphs, there is a satisfying conceptual unification of the two 
subfields. However, the impact of this unified view on the practice of research into 
heuristic search methods has been minor. The two subfields have continued to 
develop in parallel, with little interaction between them. One reason for the lack 
of coherence between the two communities could be the difference in objectives: 
a case can be made that winning chess tournaments requires a different mind 
set than optimizing industrial problems to increase revenue. 

This article has the following contributions: to our understanding of heuristic 
search: 

— Single-agent and two-agent search algorithms both traverse search graphs. 
The difference between the two algorithms is not in the graph, but in the 
semantics imposed by the application. Much of the research done in single- 
and two-agent search does not depend on the search algorithm, but on the 
search space properties. 

— Nilsson’s m and Pearl’s m dichotomy — the OR versus AND/OR choice — 
is misleading. Heuristic search consists of identifying properties of the search 
space and implementing a number of search techniques that make effective 
use of these properties. There are many such properties, and the choice of 
backup rule (minimaxing in two-agent search; minimization in single-agent 
search) is but one. The implication of Nilsson’s and Pearl’s model is that the 
choice of backup rule is in some way fundamental; it is not. This paper argues 
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for viewing heuristic search as the process in which properties of a search 
space are specified. Once that has been done, the relevant search techniques 
(basic algorithm and enhancements) follow naturally. 

— Over the years researchers have uncovered an impressive array of search en- 
hancements that can have a dramatic effect on search efficiency. The typical 
scenario is that the idea is developed in one of the domains and possibly 
later reinvented in the other. In this paper we list search space properties 
under which many search enhancements are applicable, showing that the 
distinction between single- and two-agent search is not essential. By merg- 
ing the work done in these two areas, the commonalities and differences can 
be identified. This can be used to construct a generic search framework for 
designing high performance search algorithms. 

The message of this article is that single- and two-agent search can and should be 
considered as a single undivided field. It can, because the essence of search is en- 
hancements, not algorithms as is usually thought. It should, because researchers 
can benefit by taking advantage of work done in a related field, without rein- 
venting the technology, if they would only realize its applicability. Given all the 
similarities between the two areas, one has to ask the question: why is it so 
important to make a distinction based on the backup rule? 

This article is organized as follows: Section [2 discusses the importance of 
search enhancements. Section [31 gives a taxonomy of properties of the search 
space, which are matched up with the applicable search techniques in Section!?] 
Section O draws some conclusions. The article is restricted to classical search; 
algorithms such as simulated annealing and hill climbing are outside our scope. 

2 Algorithms vs Enhancements 

Most introductory texts on artificial intelligence start off explaining heuristic 
search by differentiating between different search strategies, such as depth-first, 
breadth-first, and best-first. Single-agent search is introduced, perhaps illus- 
trated with the 15-Puzzle. Another section is then devoted to two-player search 
algorithms. The minimax principle is explained, often followed by alpha-beta 
pruning. The focus in these texts is on explaining the basic search algorithms 
and possibly their fundamental differences (the backup rule and the decision as 
to which node to expand next). And that is where most AI books stop their 
technical discussion. 

In contrast, in real-world AI applications, it is the next step — the search 
enhancements — that is the topic of interest, not so much the basic algorithm. The 
algorithm decision is usually easily made. The choice of algorithm enhancements 
can have a dramatic effect on the efficiency of the search. Although it goes too 
far to say that the underlying algorithm is of no importance at all, it is fair 
to say that most research and development effort for new search methods and 
applications is spent with the enhancements. 

Some of the enhancements are based on application-specific properties; oth- 
ers work over a wide range of applications. Examples of application-dependent 
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enhancements include the Manhattan distance for the sliding-tile puzzle, and 
first searching moves that capture a piece before considering non-capture moves 
in chess. Examples of application-independent enhancements are iterative deep- 
ening and cycle detection . 

The performance gap between search algorithms with and without enhance- 
ments can be large. For example, something as simple as removing repeated 
states from the search can lead to large reductions in the search tree (e.g. |2B] 
using IDA* in sliding-tile puzzles; |2^ using alpha-beta in chess) . Combinations 
of enhancements can lead to reductions of several orders of magnitude. 

In the traditional view, new applications are carefully analyzed until an ap- 
propriate algorithm and collection of algorithm enhancements is found that sat- 
isfies the user’s expectations. In this view, each problem has its own unique 
algorithmic solution; a rather segmented view. In reality, most search enhance- 
ments are small variations of general ideas. Their applicability depends on the 
properties of the search space, and the single/two-agent property is but a minor 
distinction that effects very few enhancements. It is the search enhancements 
that tie single/two-agent search together, achieving the unity that Nilsson’s and 
Pearl’s models strived for, albeit of a different kind. 

3 Modeling Search 

Our thesis is that most search enhancements are independent of the single/two- 
agent distinction. This section identifies key properties of a search application 
that dictate the applicability of the search enhancements. The next section il- 
lustrates this point with some representative enhancements. 

Search program design consists of two parts. First, the problem solver must 
specify the properties of the state space. Second, based on this information, an 
appropriate implementation is chosen. Defining the properties of the state space 
includes not only the domain-specific constraints (graph and solution definition), 
but also constraints imposed by the problem solver (resources, search objectives, 
and domain knowledge). 

— Graph Definition: The problem definition allows one to construct a graph, 
where nodes represent states, and edges are state transition operators. This 
is typically just a translation of the transition rules to a more formal (graph) 
language. It provides the syntax of the state space. 

— Solution Definition: Goal nodes are defined and given their correct value. A 
rule for combining the values of a node’s successors to determine the value 
of the parent node is provided (such as minimization, or minimaxing) . This 
adds semantics to the state space graph. 

— Resource Gonstraints: Identify execution constraints that the search algo- 
rithm must conform to. 

— Search Objectives: The problem solver defines the goal of the search: an 
optimal or satisficing answer. This is usually influenced by the resource con- 
straints. 
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— Domain Knowledge: Non-goal nodes may be assigned a heuristic value (such 
as a lower bound estimator or an evaluation score). The properties of the 
evaluation function fundamentally influence the effectiveness of many search 
enhancements, typically causing many iterations of the design-and-test cycle. 

Once these properties are specified, the problem solver can design the application 
program. This is a three step process. 

1. Search Algorithm: The single/two-agent distinction is usually unambiguous, 
and the algorithm selection is often trivial (although, for example, there 
exists a large number of inventive, lesser-known alternatives, including [21 

m)- 

2. Search Enhancements: The literature contains a host of search enhancements 
to exploit specific properties of the search space. The right combination can 
dramatically improve the efficiency of the basic algorithm. 

3. Implementation Choices: Given a search enhancement, the best implementa- 
tion is likely to be dependent on the application and the choice of heuristics. 
These considerations are outside the scope of this paper. 

Typically the choice of basic algorithm (single/two-agent) is easily made based 
on the problem definition. For most applications, the majority of the design 
effort involves judiciously fine tuning the set of algorithm enhancements mm- 
The applicability of search algorithm enhancements is determined by the five 
categories of properties of the state space. Figure |T] summarizes the interaction 
between the state space properties (x axis) and step 2 of the algorithm design 
process — the enhancements (the y axis) . A sampling of enhancements are illus- 
trated in the figure. The table shows how the search enhancements match up 
with the properties. An “x” means that the state space property affects the 
effectiveness of the search enhancement. A “v” means that the search enhance- 
ment (favorably) affects a certain property of the search space. For example, the 
“v”s on the row for time constraints indicate that most search enhancements 
make the search go faster. Star entries mean that a search enhancement was 
specifically invented to attack a property. 

The five categories of search properties have been subdivided into individual 
properties. The following provides a brief description of these properties. 

3.1 Graph Definition 

The problem specification, the rules of the application, implicitly define a graph. 
Following the terminology of PI a problem space consists of states and transition 
functions to go from one state to another. For example, in chess a state would be 
a board description (piece locations, castling rights, etc.). The transition function 
specifies the rules by which pieces move. In the traveling salesperson problem 
(TSP), a state can be a tour along all cities, or perhaps an incomplete tour. The 
transition function adds or replaces a city. 

The graph is treated as merely a formal representation of the problem, as 
yet devoid of meaning. It has not yet been decided what concepts like “payoff 
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Fig. 1. Search Properties vs Enhancements 



function” and “backup rule” mean. The problem graph is purely a syntactical 
description of the problem space. Semantics are added later. 

The graph has a number of interesting properties that can be exploited to 
improve the efficiency of the search. Of interest are the in degree and out 
degree (branching factor) of nodes, the size of the graph, and whether the 
graph contains cycles. These properties are self-explanatory. 

3.2 Solution Definition 

In this part of the problem solving process meaning is attached to some of the 
states. If the graph definition provides us with a syntactic description of the 
problem, then the solution definition associates semantics to the graph. The 
meaning, or value, of certain states in the graph is defined by the application 
rules. For example, in chess all checkmate states have a known value. In the TSP, 
a tour that visits all cities and ends in the original one is a possible solution. 
The objective of the search is to find these goal or solution states, and to report 
back how they can be reached. Solutions are a subset of the search space, and 




Unifying Single-Agent and Two-Player Search 



7 



this space can be defined by the solution density, solution depth, and the backup 
rule for solution states. 

Solution Density. The distribution of solution states determines how hard 
searching for them will be. When there are many solution states it will be easier 
to find one, although determining whether it is a least cost solution (or some 
other optimality constraint) may be harder. 

Solution Depth. An important element of how solution states are dis- 
tributed in the search space is the depth at which they occur (the root of the 
graph is at depth 0). Search enhancements may take advantage of a particular 
distribution. For example, breadth-first search may be advantageous when there 
is a high variability in the depth to solution. 

Solution Backup Rule. The problem description defines how solution val- 
ues should be propagated back to the root. Two- agent games use a minimax 
rule; optimization problems use minimization or maximization. 



3.3 Resource Constraints 

Resource constraints (space and time) play a critical role in determining which 
enhancements are feasible. 



3.4 Search Objective 

One of the most important decisions to be taken is the objective of the search. 
This decision is influenced by the size of the problem graph, solution density 
and depth, and resource constraints. It is closely related to the classical choice: 
optimize or satisfice m- The choice of search objective defines a global stop 
condition. 

Optimization. Optimization involves finding the best (optimal) value for 
the search problem. Given a problem graph, the properties that determine wheth- 
er optimization is feasible are solution density and depth. 

Satisficing. Sometimes optimization is too expensive and one needs real- 
time or anytime algorithms. In this case, a payoff, or evaluation function, is 
applied to a set of states that lie closer to the root of the graph. The evaluation 
function is a heuristic approximation of the true value of the state. The search 
progresses, trying to find the best approximation to the true solution, subject 
to the available resources. 



3.5 Domain Knowledge 

The heuristic evaluation function encodes application-dependent domain 
knowledge about the search. Typically, it is the most important component of a 
search application. Unfortunately, it has to be redeveloped anew for each prob- 
lem domain. Since the heuristic function is application dependent, most of its 
internals cannot be discussed in a general way. The external characteristics, 
however, can. 
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There are many different types of information that can be returned by a 
heuristic evaluation. Some examples include: lower/upper bound estimates on 
the distance to solution, point estimates on the quality of a state, ranges of 
values, and probability distributions. 

The most important aspect of the heuristic evaluation function is the differ- 
ence between the heuristic value h and the true value for a state. In general, the 
better the quality of h, the more efficient the search. Related to the quality of 
the heuristics are parent/child correlation of state (how much the state 
changes by a state transition), parent /child correlation of value (how simi- 
lar the value is between a parent and child node), and the granularity [2Z] of 
the heuristic function (the coarseness of the values; finer granularity generally 
implies more search effort). 

The search algorithm together with heuristic information is used to decide 
on the next node to expand in the search. For some applications, the decision 
may be mechanical, such as depth-first, breadth-first or best-first, but heuristic 
information can be instrumental in ordering nodes from most- to least-likely to 
succeed. 



4 Search Enhancements 

This section classifies various search enhancements used. The enhancements have 
been grouped into classes, of which a few of the more interesting ones are dis- 
cussed (the ones illustrated in Figure [T]). For each class, a representative tech- 
nique is given and its applicability to single- and two-agent search is discussed. 
The material is intended to be an illustrative sample (because of space con- 
straints), not exhaustive. Since in most cases the preconditions necessary for us- 
ing an enhancement are not tied to any fundamental property of an application, 
the search enhancements presented are applicable to a wide class of applications. 



4.1 State Space Techniques 

These techniques depend only on the application definition and are therefore 
independent of the algorithm selected. 

Path Transposition and Cycle Detection 

Precondition: In-degree is > I. Two search paths can lead to the same state. Idea: 
Repeated states encountered in the search need only be searched once. Search 
efficiency can (potentially) be improved dramatically by removing these redun- 
dant states. Advantages: Reduces the search tree size. Disadvantages: Increases 
the cost per node and/or storage required. Techniques: Two-agent: the typical 
technique is to store positions in a hash table to allow for rapid determination 
if a state has been previously seen [^. Single-agent: in addition to hash tables 
m, finite state machines have been used to detect cycles [^H] . 
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4.2 State- and Solution-Space Interaction 

These enhancements depend on the state space graph and on the definition of 
the solution space. 

State Space Enumeration 

Precondition: Size of the state space graph and/or solution search tree be “small.” 
Idea: If the state space is small enough, then the optimal answer can be com- 
puted. For some applications, traversal of the entire state space may not be 
necessary; one need only traverse the solution tree, ignoring parts of the state 
space that can logically be proven irrelevant. Advantages: Optimal answer for 
some/all nodes in the state space. Disadvantages: May require large amounts of 
time and space to traverse the state space and save the results. Techniques: Sev- 
eral games and puzzles with large state spaces have been solved by enumeration, 
including Nine Men’s Morris [7j , Qubic, Go Moku |T| , and the 8-Puzzle j2^ and 
12-Puzzle. 



4.3 Successor Ordering Techniques 

The order in which the successors of an interior node are visited may effect the 
efficiency of the search. For example, in the alpha-beta algorithm, searching the 
best move first achieves the maximal number of cutoffs. In single-agent search, 
searching the best move first allows one to find the solution sooner. These en- 
hancements depend on one property of the application: whether the order of 
considering branches influences when a cutoff occurs. 

There are many techniques for doing this in the literature including previous 
best move ordering and the history heuristic |23]. Both ideas have been 
tried in single- and two-agent applications (although the benefits in optimization 
seem to be necessarily small m)- 



4.4 Repeatedly Visiting States 

One of the major search results to come out of the work on computer chess was 
that repeatedly visiting a state, although seemingly wasteful, may actually prove 
to be beneficial. The effectiveness of this enhancement depends ultimately on the 
heuristic evaluation function, although it works for a large class of applications. 

Iterative Deepening 

Precondition: Information from a shallow search satisfying condition d must pro- 
vide some useful information for a deeper search satisfying d+ A. Idea: Search 
down a path until a condition d is met. After the entire tree has been searched 
with condition d, and no solution has been found, repeat a deeper search to 
satisfy condition d+ A. Advantages: For two-agent search, the main advantages 
are move ordering and time management for real-time search. For single-agent 
the benefit is reduced space overhead. Disadvantages: Repeated visitations cost 
time. The value of the information gathered must outweigh the cost of collecting 
it. Techniques: In many two-agent applications, the search iterates on the search 
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depth. Move ordering is critical to the efficiency of alpha-beta search. By stor- 
ing the best moves of each searched node, in each iteration the move ordering 
of another level of the search tree is improved [24l 1^ . In single-agent search, 
iterative deepening is used to refine the upper (lower) bound on the value be- 
ing minimized (maximized). It is primarily used because it reduces the space 
requirements of the application |14| . 

4.5 Off-Line Computations 

It is becoming increasingly possible to precompute and store large amounts of 
interesting data about the search space that can be used dynamically at runtime. 

Solution Databases 

Precondition: One must be able to identify goal nodes in the search (trivial). 
Idea: The databases define a perimeter around the goal nodes. In effect, the 
database increases the set of goal nodes. Advantages: The search can stop when 
it reaches the database perimeter. Disadvantages: The databases may be costly 
to compute. Furthermore, the memory hierarchy makes random access to tables 
increasingly costly as their size grows. Techniques: In two-agent search, solution 
(or endgame) databases have been built for a number of games, in some cases 
resulting in dramatic improvements in the search efficiency and in the quality 
of search result. In single-agent applications solution databases have been tried 
in the 15-Puzzle. An on-line version of this idea exists, dynamically building the 
databases at runtime (bi-directional or perimeter search |16|1. 

4.6 Search Effort Distribution 

The simplest search approach is to allocate equal effort (search depth) to all chil- 
dren of the root. Often there is application-dependent knowledge that allows the 
search to make a more-informed distribution of effort. Promising states can be 
allocated more effort, while less promising states would receive less. (Essentially, 
this enhancement can be regarded as a generalization of successor ordering.) In 
satisficing single-agent search this idea is used to concentrate the search effort 
on promising branches. For optimizing single-agent search, it is of limited value 
since even if an extended search, for example, finds a solution, all possible non- 
extended nodes must still be checked for a better solution. It is also beneficial for 
real-time single-agent search such as RTA* [l5j and other anytime algorithms. 
In two-agent search it is used in forward pruning or selective search. Popular 
ideas used in two-agent search include singular extensions , the null move 
heuristic [^, and ProbCut [^. 

5 Conclusion 

For decades researchers in the fields of single- and two-agent heuristic search 
have developed enhancements to the basic graph traversal algorithms. Histori- 
cally the fields have developed these enhancements separately. Nilsson and Pearl 
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popularized the AND/OR framework, which provided a unified formal basis, but 
also stressed the difference between OR and AND/OR algorithms. The fields 
continued their relatively separate development. 

This paper advances the view that the essence of heuristic search is not 
searching either single- or two-agent graphs, but which search enhancements 
one uses. First, the single/two-agent property is but one of the many properties 
of the search space that play a role in the design process of a high performance 
heuristic search application. Second, the single/two-agent distinction is not the 
dominant factor in the design and implementation of a high-performance search 
application — search enhancements are. Third, most search enhancements are 
quite general; they can be used for many different applications, regardless of 
whether they are single- or two-agent. 

The benefit of recognizing the crucial role played by search techniques is 
immediate: application developers will have a larger suite of search enhancements 
at their disposal; ideas first conceived of in two-agent search will not have to be 
rediscovered later independently for single-agent search, and vice versa. In an 
implementation the best combination of techniques depends on the expected 
search benefits versus the programming efforts, not on the single- or two-agent 
algorithm. 

For twenty years, most of the research community has (explicitly and implic- 
itly) treated single- and two-agent search as two different topics. Now it is time to 
take stock and recognize the pivotal role that search enhancements have come to 
play: the algorithm distinction is minor, and most research and implementation 
efforts are directed towards the enhancements. All the properties of the search 
space — not just the single/two-agent distinction — play their role in determining 
the effectiveness of that what heuristic search is all about: enhancing the basic 
search algorithms to achieve high performance. 
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Abstract. Traditionally, chess has been called the “fruitfly of Artifi- 
cial Intelligence”. In recent years, however, the focus of game playing 
research has gradually shifted away from chess towards games that offer 
new challenges. These challenges include large branching factors and im- 
perfect information. The game of Hex offers some interesting properties 
that make it an attractive research subject. This paper presents the key 
ideas behind Queenbee, the first Hex playing program to play at the level 
of strong human players. 

Keywords: Heuristic search, game playing, alpha-beta. Hex. 



1 Introduction 

Game playing has often been described as an ideal test bed for Artificial In- 
telligence research. Traditionally, game playing research has focussed mainly 
on chess. Several decades of research have produced some powerful techniques, 
mostly geared at the efficient traversal of large game trees. It has also produced 
some notable triumphs; humans have been surpassed by programs in games 
such as checkers, Scrabble, and Othello, and other games such as Go-Moku, 
Gonnect-4, and Nine Men’s Morris have even been solved. 

With the advent of the checkers world champion program Chinook and the 
chess program Deep Blue, researchers started to realize that the techniques that 
drive these programs had been all but stretched to their limits. Yet there are 
other classes of games for which these methods would be of little use in construct- 
ing a program that can play on par with the strongest humans. One such class 
includes games in which not all of the information is available to each player, 
such as in card games like bridge and poker. Another class is the one containing 
games whose branching factor, defined as the typical number of available options 
for a player when it is time to make a move, is too large to make brute force tree 
search algorithms feasible. The most well-known of these games is the Oriental 
board game Go, for which no strong programs exist despite considerable effort 
and expertise that has been devoted to it. 

Another member of the class of high branching factor games is Hex. It is 
similar to Go, but may be a better choice for study due to the simplicity of 
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the goal and the rules. The game has several interesting properties. It can be 
played on a board of any size, thus becoming arbitrarily complex in terms of the 
branching factor. The rules are simple, yet they give rise to elaborate strategic 
ideas. Hex has a rich underlying mathematical structure that ties in to advanced 
concepts in graph theory and topology. It has the interesting properties that 
games cannot end in a draw^ and that it can be proved that the game is a first 
player win with perfect play. However, the winning strategy is not known. 

Queenbee is a Hex playing program that is based on a novel idea for an 
evaluation function. It is the first Hex program to surpass “novice” level in 
human terms. Indeed, it now plays at the level of very strong human players, if 
not quite yet at the level of the top players. Queenbee has also carried out the 
first complete analysis of all opening lines on a 6 x 6 board. The program has 
its own web page, http : //www. cs .ualberta. ca/~queenbee, which includes the 
6x6 opening analysis. 

This paper is organized as follows. Section [^introduces the game of Hex and 
its properties. The next two sections describe Queenbee’s evaluation function 
and its search algorithms, respectively. An assessment of Queenbee’s playing 
strength is given in Section Section [6l mentions current work in progress and 

future work, followed by a summary and conclusions. 

2 Hex 

Hex was invented by Danish engineer, poet, and mathematician Piet Hein (1905- 
1996) in 1942, and independently rediscovered by John Nash (1928-) in 1948. 
It is a board game with simple rules, but a complex strategy. Indeed, winning 
strategies are only known for board sizes up to 7 x 7, whereas the game is 
commonly played on sizes of 10 x 10 or larger. The game is a special case of a 
more general graph colouring game known as the Shannon switching game, which 
was proved to be PSPACE-complete |ET76| . This section describes the rules of 
Hex, as well as some special properties of the game. 



2.1 Rules 

Hex is played on a rhombic hexagonal pattern, as in Figure [T] This particular 
Hex board has 5x5 cells, but the game can be played on boards of any size. 
The board has two white borders and two black borders, indicated in the figure 
by rows of discs placed next to the borders. It is often helpful to imagine the 
presence of these “ghost edge pieces” . Note that the four corner cells each belong 
to two borders. 

Play proceeds as follows. The two players, henceforth to be called White and 
Black, take turns placing a piece of their colour on an empty cell. There is no 
standard convention on which colour gets the first move. White wins the game 
by connecting the two white borders with a chain of white pieces, while Black 



^ This is directly equivalent to a fundamental theorem of topology; see |Gal86J . 
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Fig. 1. A 5 X 4 Hex board with a winning chain for Black 




Fig. 2. A two-bridge 



wins by establishing a chain of black pieces connecting the two black borders. 
In Figure [T] Black has completed a winning chain. 

A detailed discussion of Hex strategy is beyond the scope of this paper. 
However, it is necessary at this point to introduce its most basic element: the 
“two-bridge” . Figure H] depicts two black pieces; the two cells marked ‘ x ’ are 
still empty. Black can connect the two pieces even if White plays first; whenever 
White plays in one of the cells marked ‘ x Black occupies the other. This ensures 
what is known as a “virtual connection” between the two black pieces. It is the 
simplest example of a virtual connection guaranteed by two disjoint threats. 



2.2 Properties 

One of the properties of Hex is that the game can never end in a draw. The proof 
of this intuitively obvious fact is based on the observation that a Hex board that 
is completely filled with pieces must necessarily contain a winning chain for one 
of the two players0 Another interesting property, first noted by John Nash, is 
that Hex can be proved to be a theoretical win for the first player. This proof 
is based on the “strategy stealing argument” : if there were a winning strategy S 
for the second player, then the first player could play an arbitrary opening move 
and subsequently apply S to win. The arbitrary opening move cannot spoil this 



2 See |Gal86J . 
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strategy because an extra piece can never be a disadvantage in Hexj^ Since we 
cannot have both players winning the game, there cannot be such a strategy, 
and since draws are impossible, it follows that there is a winning strategy for 
the first player. 

3 Evaluation Function 

A game playing program needs a good evaluation function to help guide the 
search. It is not immediately obvious how to construct a meaningful evaluation 
function for Hex. For example, unlike in many other board games, the concepts 
of material balance and mobility are not useful in Hex. This section contains 
new ideas for a Hex evaluation function, which are implemented in Queenbee. 
The function calculates the distance to each edge of all the unoccupied cells 
on the board, according to an unconventional metric called “two-distance” . The 
resulting distances are referred to as “potentials” . 



3.1 Two-Distance 

Given a graph F with an adjacency function n(p) that maps a vertex p onto 
the set of the vertices that are adjacent to it, the conventional distance metric 
is actually a special case of a more general distance metric d: 

(0 if g = p, 

d{p,q) = < 1 if g G n{p), 

[ minfc Cfe(p) > z otherwise. 



where 

Ck{p) = |{r G n{p)\d{r,q) < k}\. 

The conventional distance metric corresponds to z = 1, in which case the distance 
of a cell to an edge on the Hex board represents the number of “free moves” that 
it would take for a player to connect the cell to the given edge. Unfortunately this 
distance function is not very useful for building an evaluation function for Hex, 
as will be shown later. Rather, the concept of two-distance is used, where z = 2. 
The two-distance is one more than the second lowest distance of p’s neighbours 
to g, with the proviso that the two-distance equals 1 if p and g are directly 
adjacent. The intuition behind the two-distance idea is that, when playing a 
adversary game, one can always choose to force the opponent to take the second 
best alternative by blocking the best one. The two-distance captures this concept 
of “the best second-best alternative”. 

There is an important distinction between adjacency and neighbourhood. 
Adjacency implies neighbourhood, but not vice versa. Two cells are adjacent if 
they share a common edge on the board. The notion of neighbourhood takes 
into account any black and white pieces that are already on the board. Two 

For this reason, the argument will not work for games such as chess. 



3 



Are Bees Better than Fruitflies? 



17 



unoccupied cell^ are neighbours from White’s point of view if either they are 
adjacent or there is a string of white stones connecting them. Note that a cell’s 
neighbourhood can therefore be different from White’s point of view than it is 
from Black’s point of view. These two neighbourhoods will be referred to as 
the W-neighhourhood and the B -neighbourhood. Correspondingly, there will be a 
distinction between W-distance and B-distance. 





I. W-distance to lower white edge 



II. B-distance to upper black edge 



Fig. 3. Two-distances on a non-empty board 



The two-distances can be computed using the standard Dijkstra algorithm for 
calculating distances in grapB, modified appropriately for z = 2. An example of 
two-distances is shown in Figured Consider the cell containing the underlined 
3. Its two-distance to the lower white edge is not 2, because it only has one 
neighbour that is at two-di^ance less than 2. Note also that the ghost edge 
pieces must be taken into account when calculating the distances. This explains 
why the rightmost cell in Figure |3}I is at two-distance 6. Due to the ghost edge 
pieces, all the cells along the upper white edge are its W-neighbours, and at least 
two of those are at two-distance less than 6. 

3.2 Potentials 

The goal in Hex is to connect two sides of the board. To help achieve this, one 
might look for an unoccupied cell that is as close as possible to being connected 
to both sides, as this would be a promising candidate for being part of a winning 
chain. The evaluation function calculates potentials that capture this concept. 
Each unoccupied cell is assigned two potentials, based on the two-distance met- 
ric. A cell’s W-potential is defined as the sum of its W-distance to both white 
edges; its B-potential is the sum of its B-distance to both blac^edges. 

Cells with low W-potentials are the ones that are closest to being connected 
to both white borders by White. If White can connect a cell to both white 
borders, this would establish a winning chain. White will therefore focus on those 
cells that have the lowest W-potentials. The white board potential is defined as 

Neighbourhood is only ever used for empty cells. 
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Fig. 4. Cell potentials 



the lowest W-potential that occurs on the board. In the example of Figure |4] the 
white board potential is 4, and the black board potential is 5. As lower potentials 
are better, it appears that White is ahead. 

In the same figure, it can be seen that both Black and White have only 
one cell that actually realizes their board potential. It would be better to have 
more than one cell that realizes the board potential, so as to have more attack 
options and be less vulnerable to the opponent blocking the chain. This is used 
in Queenbee’s evaluation function to break ties between positions with equal 
board potentials. 

□ 

3.3 Strategic Relevance 

The idea behind the two-distance metric is directly related to the importance of 
double threats. Indeed the two-distance implicitly takes into account the two- 
bridges that occur in a Hex position. Consider the position in Figure O The 
Black ^stance to each edge cannot percolate through the White two-bridges. 
As White already has a winning connection made up of two-bridges, the result 
is that the Black board potential is infinite. Thus, the two-distance metric also 
implicitly recognizes a winning chain that consists of virtual connections through 
two-bridges, even if the chain is not actually solidified yet. 

By contrast, using the conventional z = 1 distance metric in the position of 
Figure |5] would yield board potentials of 4 for both White and Black, suggesting 
that both players are equally close to establishing a winning connection. It is 
clear that the two-distance metric is far more suited to Hex than the conventional 
distance metric is. 



3.4 Cell Potentials 

As mentioned before. White will want to play in cells that have a low W- 
potential, as those are the cells that are closest to being connected to both white 
edges. Simultaneously, White will also want to focus on cells that have low B- 
potentials. Those are the cells where Black is closest to establishing a winning 
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O 



B-distance to lower black edge 



Fig. 5. The two-distance cannot percolate through two-bridges 



connection, and therefore White will want to play in those cells to block Black’s 
connection. Combining this, White will prefer to play in cells that have a low 
total potential, where the total potential of a cell is the sum of its W-potential 
and its B-potential. By symmetry. Black will prefer to play in the same cells. 
This is analogous to the Go proverb “Your opponent’s most important play is 
your most important play.” The total potentials for the position of Figure [Hare 
shown in Figure O 




o 



Fig. 6. Total potentials 



4 Search 

Queenbee uses an iterative deepening a-(3 search enhanced with Minimal Win- 
dow / Principal Variation Search and transposition tables. These techniques are 
used in most state-of-the-art game playing programs [Mar86| . The move order- 
ing is based on cell potentials. Queenbee’s search incorporates the fractional ply 
searching ideas of the “Sex Search algorithm” [LBT89J . 
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4.1 Sex Search 

The large branching factor of Hex makes regular full-width searching methods 
inadequate, even when enhanced with conventional search extensions and reduc- 
tions. On the other hand, a highly selective search is too unreliable due to its 
inability to cope with some tactical moves. The Sex Search algorithm, as de- 
scribed by Levy, Broughton, and Taylor |LBT89j is essentially a generalization 
of search extensions and reductions, whose behaviour can range smoothly over 
the spectrum between full-width and fully selective. Moreover, it is amenable to 
automated learning. The name “Sex Search” stands for “search extensions” . 

Sex Search proposes to assign a weight, or cost, to every move in the search 
tree. Rather than exploring lines until a certain fixed depth is reached, the Sex 
algorithm explores lines until their moves add up to a fixed cost. This cost 
limit may be called the budget. The idea is that “interesting” moves have low 
cost, while “uninteresting” moves have high cost. This way, branches with many 
uninteresting moves are not explored very deeply, which corresponds to search 
reductions. At the same time, branches that contain many interesting moves will 
be explored more deeply, corresponding to search extensions. 

If all moves are assigned a cost of 1, then a Sex Search with a budget of n is 
equivalent to a full-width fixed-depth search to n ply. If the moves have varying 
cost, but the average cost is I, then the Sex Search is comparable to an n ply 
search with extensions and reductions. A move cost of k effectively extends the 
search by 1 — A: ply if fc < 1, and reduces the search by fc — 1 ply if A: > 1. 

If the range of costs of the available moves is large, then the Sex Search 
algorithm behaves much like a selective search. Consider, for example, a move 
m with cost 4. This cost ensures that the subtree below m will be explored to 
a depth of 3 less than the subtrees below m’s siblings. Due to the exponential 
nature of the search tree, the search effort required to explore m becomes in- 
significant in comparison with the effort required to explore m’s siblings. Thus 
the behaviour is similar like that of a selective search that would discard move 
m altogether. A selective search suffers from the unavoidable risk of discarding 
moves that turn out to be critical. Sex Search does not run this risk, as it does 
not actually discard any moves. 

Sex Search is used in some high performance game playing programs, most 
notably in the well-known chess program Deep Blue [CHH99j . However, in most 
classes the fractional move costs are only assigned to certain special cases of 
moves, while the majority of the moves receives weight 1. Queenbee’s search is 
fully fractional, in that each move category has a fractional weight. 

4.2 Move Categories 

The crux of the Sex Search algorithm is finding a good cost function for moves. 
Queenbee uses the cell potential as a basis for the cost function. Moves are 
partitioned into equivalence classes, or move categories. Each move category 
has a weight associated with it. The cost of a particular move is obtained by 
retrieving the weight of its move category. A move m in a cell with potential p is 
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a member of move category c(m) = p — p* , where p* is the lowest cell potential 
of all the other cells. If m has the lowest potential then its move category will be 
0 if there are other moves with the same potential. If m has the unique lowest 
potential, the move category will be negative and the static evaluation of m will 
be —c{m) better than that of the next best looking move. 

5 Performance 

The best way to obtain information about a game playing program’s playing 
strength is by playing games against opponents of known strength. In Hex this 
can be done on the online game playing server Playsite@ Queenbee played a 
small number of games on Playsite, in order to get a first impression of its playing 
strength. To obtain reliable information, however, many games are needed. Un- 
fortunately, it is not yet possible to play online games automatically, as Playsite 
is accessible only through a Java interface with an unknown communications 
protocol. 

The initial results show that Queenbee achieved a rating of 1876 in six games. 
Typical ratings for a human players are in the range 1200-1400 for novices, 
1800 for advanced players, and 2100-2200 for the top players. Strong human 
players estimated Queenbee’s rating to be about 2100. The program scored two 
consecutive wins against a player rater 2119. However, these results may not 
mean much. The number of games played is small, and there likely is some 
overestimation on the part of the opponents due to their unfamiliarity with 
Queenbee’s style. 

5.1 Evaluation Function 

Since Queenbee has completely solved many opening positions in 6 x 6 Hex, it is 
possible to compare the evaluation function’s assessment of these positions with 
perfect knowledge. In a winning position we distinguish between good moves and 
bad moves. A bad move loses against perfect play, while a good move preserves 
the win. There are two types of good moves: optimal moves and suboptimal 
moves. A move is optimal if it maintains the shortest possible win. In a losing 
position there are no good or bad moves, as every move will lead to a loss against 
a perfect opponent. Yet there still is a distinction between optimal and subop- 
timal moves. Optimal moves are those that delay the loss as long as possible, 
while suboptimal moves do not. 

Table[T]lists the number of times each move category occurred, in the “count” 
rows, as well as the frequency in percentages of each move type. The numbers 
were obtained from to 27 winning positions and 34 losing positions that Queen- 
bee has analyzed. The table indicates that the lower move categories contain a 
significantly higher percentage of good moves than average. Note that negative 
move categories, intuitively corresponding to apparently “forced” moves, appear 

® http : //www. playsite . com/games/board/hex. 
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to be better in losing positions than they are in winning positions. This may be 
because the winning side typically has several options to choose from, while the 
losing side is more often forced to reply to a threat. 

The effectiveness of the Sex Search relies on the move categories’ capacity to 
distinguish good moves from bad moves. Any category that has a significantly 
different distribution of good and bad moves compared to the overall distribution 
is therefore valuable. Categories can be assigned low or high weights according 
to whether good moves are relatively common or uncommon, respectively. The 
bottom row of Table d indicates that almost all categories do show a large 
deviation from the average frequency. Moreover, the frequency of optimal moves 
decreases almost monotonically as the move category increases, indicating that 
the cell potential is indeed a good estimator of move strength. 



position 


move 


move category 


type 


type 


< -3 


-2 


-1 


0 


1 


2 


3 


4 


5 


6 


7 


> 8 


all 




optimal 


- 


25 


13 


25 


11 


2 


7 


2 


4 


- 


- 


- 


5 


winning 


suboptimal 


- 


- 


- 


22 


6 


20 


6 


6 


8 


3 


- 


- 


7 




good 


- 


25 


13 


47 


17 


22 


14 


8 


12 


3 


- 


- 


11 




bad 


- 


75 


88 


53 


83 


78 


86 


92 


88 


97 


100 


100 


89 




count 


- 


4 


8 


60 


71 


134 


140 


171 


131 


113 


39 


19 


890 


losing 


optimal 


100 


100 


100 


35 


6 


1 


1 


0 


- 


- 


- 


- 


4 




suboptimal 


- 


- 


- 


65 


94 


99 


99 


100 


100 


100 


100 


100 


96 




count 


3 


3 


5 


63 


62 


139 


203 


213 


178 


148 


56 


76 


1149 


all 


optimal 


100 


57 


46 


30 


9 


2 


4 


1 


1 


- 


- 


- 


4 



Table 1. Move types in each move category 



In order for the search to produce reliable results, it is not necessary that all 
good moves in a position be found. What is important is that at least one optimal 
move is found. Table [2] lists the lowest move categories in which optimal and 
suboptimal moves were encountered in the same set of positions. It is clear that 
in most cases there is an optimal or suboptimal move to be found in categories at 
most 0 or 1 . Positions in which good moves only occur in higher move categories 
are rare. 



position 


move 


move category 


type 


type 


-5 


-2 


-1 


0 


1 


2 


3 


5 


winning 


optimal 


- 


1 


1 


9 


6 


2 


7 


1 




good 


- 


1 


1 


12 


7 


2 


3 


1 


losing 


optimal 


3 


3 


5 


21 


1 


- 


- 


1 



Table 2. Category containing the lowest potential move 
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From these tables it becomes apparent that the evaluation function enables 
a good partitioning of moves into classes of different move quality. It is also clear 
that the cell potentials provide a good estimate of the quality of the moves, since 
good moves are more common in low move categories. 



5.2 Search 

Queenbee was tested on 25 analyzed winning positions, to find out how many of 
these positions can be played correctly under actual tournament play conditions. 
It is not sufficient to merely find a winning move. The program is only consid- 
ered to have “solved” a position once it settles on a winning move and never 
changes its mind to a losing move anymore. On a PII/400 machine, Queenbee 
searches about 12,000 nodes per second. This means that the program is able 
to search about two million nodes within the time constraint. A position was 
played correctly if Queenbee settled on a winning move within two million nodes 
of search. 

With all category weights set to 1, corresponding to a full- width fixed-depth 
search, Queenbee was able to solve 12 out of the 25 positions. With hand tuned 
weights, the performance increased to 14 out of 250 The median solution length 
for these positions was 22 ply. 

The depth of a win is not necessarily what makes a position hard to solve. 
If the winning move is obviously better than the other moves, particularly if 
all other moves lose quickly, the solution is not difficult to see. However, in the 
positions used in the test, the longest losing lines had a median length of 17 and 
a maximum length of 23. The correct decision between such deep wins and deep 
losses is extremely difficult to find. It should also be noted that these are all 
winning positions at the beginning of the game, and as such they are the most 
difficult of all 6 X 6 Hex problems. 



6 Work in Progress 

The weights for the move categories are hand picked, based on intuitions about 
the move categories and their relation to the game. These weights form an attrac- 
tive target for machine learning techniques. Experiments are currently running 
to apply the “Learning Search Control” algorithm of Yngvi Bjornsson |Bj500| 
to dynamically establish these weights based on actual game play. Early results 
seem to indicate that counterintuitive looking weights outperform more plausible 
looking weights, but the evidence is still very inconclusive. 

The search algorithm will also be enhanced with null move searches | Don93| . 
The null move technique can drastically reduce the size of search trees, and 
has proven to be very successful in chess. It suffers from the great danger of 
misjudging “zugzwang” positions, which are positions where the player to move 



The weights were set to [1, 1, 1.5, 3, 5] for categories 0 through 4, 0.5 for 
negative categories, and 6 for categories 5 and higher. 
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would prefer to skip a move. Fortunately, zugzwang does not occur in Hex. Any 
move is always better than no move at all. Null move therefore promises to be 
a powerful enhancement of the search in Queenbee. 

Efforts are underway to construct a Java interface that enables Queenbee 
to play games online on its own web page. This will not only provide valuable 
feedback on the program’s playing strength, but it will also facilitate the learning 
of move category weights. 



7 Conclusions 

The comparison with perfect information on a 6 x 6 board is a favourable indi- 
cation of the quality of the evaluation function. It is of course not to be taken 
for granted that these results will persist on larger board sizes. On the other 
hand, it should be noted that perfect play on a 6 x 6 board is far from trivial. 
One piece of anecdotal evidence is that Queenbee has actually disproved some 
of the common beliefs of strong Hex players about certain opening moves. 

Equipped with the perfect opening book and a powerful search to fill in the 
blanks where the opening book ends, Queenbee plays stronger than any human 
player on board sizes up to 6 x 6. On larger boards, human players’ supremacy 
increases as the board size increases. There is some room for improvement by 
refining the search methods, such as by incorporating null moves, but it is not 
clear whether this will be enough to beat the top human players on standard 
board sizes like 10 x 10. 

What, then, is needed to beat the top players? Human players have the 
important ability to recognize patterns in Hex. The two-bridge is the simplest 
example, but there are many common patterns known to human players where 
a virtual connection of a piece to the edge is guaranteed. It can take many plies 
of search to uncover these connections implicitly. It seems obvious to humans 
that knowledge about these patterns is therefore a key ingredient to any strong 
Hex player. Queenbee is equipped to recognize many common patterns, but it 
is not yet clear how to integrate this knowledge into the evaluation function. 
Worse still, the patterns cause significant horizon-type problem|3 that actually 
lead to a net decrease in playing strength. For this reason, Queenbee currently 
does not use the patterns at all. Finding out how to correctly use the patterns 
in the evaluation function and how to avoid destructive interference with the 
search is therefore an important challenge. 

If these problems can be solved adequately. Hex programs may be able to 
reach the first milestone, beating all human players on a 10 x 10 board, in the near 
future. Top human players prefer to play on 14 x 14 boards, and in some cases 
even 18 x 18 boards, where the game is more challenging. As it is progressively 
more difficult for computers to beat humans on larger board sizes. Hex will 

^ The horizon effect occurs when a program plays suboptimal moves in order to delay 
an impending threat; if the threat is pushed beyond the search horizon, the program 
will assume it has “solved” the problem, whereas in reality it has often made it worse. 
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remain a challenge for game playing programs even after this first milestone is 

reached. 
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Abstract. For many constraint satisfaction problems, finding complete 
solntions is impossible (i.e. problems may be over-constrained). In snch 
cases, we want a partial solution that satisfies as many constraints as 
possible. Several backtracking and local search algorithms exist that are 
based on the assignment of values to variables in a fixed order, until 
a complete solution or a reasonably good partial solution is obtained. 
In this study, we examine the dual graph approach for solving CSPs. 
The idea of dual graphs can be naturally extended to another structure- 
driven approach to CSPs, constraint directed backtracking that inher- 
ently handles k-ary constraints. In this paper, we present a constraint 
directed branch and bound (CDBB) algorithm to address the problem 
of over-constrained-ness. The algorithm constructs solutions of higher 
arity by joining solutions of lower arity. When computational resources 
are bounded, the algorithm can return partial solutions in an anytime 
fashion. Some interesting characteristics of the proposed algorithm are 
discussed. The algorithm is implemented and tested on a set of ran- 
domly generated problems. Our experimental results demonstrate that 
the CDBB consistently finds better solutions more quickly than back- 
tracking with branch and bound. Our algorithm can be extended with in- 
telligent backtracking schemes and local consistency maintenance mech- 
anisms just like backtracking has been in the past. 



1 Introduction 

Solving a constraint satisfaction problem (CSP) involves finding a consistent as- 
signment of values to variables subject to restrictions on the combinations of 
values. If the problem is over-constrained it has no solution. Partial constraint 
satisfaction problems (PCSP) were proposed by Freuder and Wallace |Z], to rep- 
resent and solve relaxed versions of over-constrained problems. Here one is willing 
to violate some of the constraints. Maximal constraint satisfaction attempts to 
find a solution that satisfies as many constraints as possible. 

Over-constrained problems commonly arise in many domains m- In addi- 
tion to relaxing the requirement that all constraints need to be satisfied, PCSP 



H. Hamilton and Q. Yang (Eds.): Canadian AI2000, LNAI 1822, pp. 26 4891 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 



A Constraint Directed Model for Partial Constraint Satisfaction Problems 



27 



algorithms can also be used to provide partial solutions when there is a fixed 
bound on computational resources. Here they report the best solutions found 
so far when the resource bound has been reached. Such algorithms are called 
anytime algorithms . Freuder and Wallace introduced complete methods with 
branch and bound as a natural extension to classical backtracking for partial sat- 
isfaction. Also, several local search algorithms have been developed to address 
the over-constrained problems, (e.g |4][T5j[T6]). Often constraints are classified as 
being either required (hard constraints) or preferential (soft constraints). Hard 
constraints must be satisfied while soft constraints may be violated. 

Graph theoretic properties of constraint networks have been used to develop 
algorithms for constraint solving Among these, algorithms that are based 

on one of two representations, the primal constraint graph and the dual con- 
straint graph are more common. Structure driven algorithms that are con- 
structed from the primal constraint graph (a generalisation of a simple binary 
constraint graph) have lead to chronological backtracking based solvers that in- 
crementally instantiate variables with values and backtrack when dead-ends are 
encountered. 

Recently a lot of research has focussed on identifying tractable classes of 
CSPs through decomposition schemes and the study of structural properties 
of the CSPs [SI 121 US]- As mentioned in mm, the CSP has been shown to 
be equivalent to various database problems. In mm a new form of local 
consistency known as w-consistency has been defined and it has been shown to 
be applicable to both binary and non-binary constraints. In addition for some 
CSPs enforcing w-consistency has been shown to be able to ensure tractability 
while enforcing other forms of consistency cannot. In |S] a new class of tractable 
CSPs is introduced based on hypertree decomposition from database theory. 
This decomposition strategy has been shown to dominate many other forms of 
consistency for the case of non-binary CSPs. 

A constraint directed backtracking algorithm {CDBT), was presented in 
mm to solve CSPs by backtracking in the dual graph. It has been shown 
that CDBT has a search space more limited than the corresponding BT search 
space m- This is intuitive from the observation that the size of a constraint is 
usually much smaller than the product of the involved variable domains. One 
of the main motivations for this study, is to investigate whether the properties 
of the CDBT algorithm can be extended with branch and bound to handle over 
constrained problems in the same way BT was extended into BTBB . In this 
paper we 

1. Present a branch and bound extension to CDBT to handle over constrained 
problems. 

2. Provide an empirical study and analysis of the CDBB algorithm on some 
executions on randomly generated CSPs. 

Sections(2] provides some definitions and background while SectionOdescribes 
the CDBT algorithm. In section2]we illustrate the execution of the algorithm on 
a simple example. Section O describes the partial constraint satisfaction model 
and the CDBB algorithm which is a structure driven algorithm based on the 
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dual constraint graph. Our experimental results (section [7D indicate that CDBB 
consistently finds better solutions much more quickly than BTBB (backtracking 
with branch and bound). 

2 Preliminary Definitions 

Let (V,D,C) be a CSP where V is the set of variables, D are their domains, 
and C is the set of constraints. Furthermore, we can assume that each constraint 
Ci = {Vi, Si) G C consists of a list of variables Vi = {vn,...,Vik) C V and a 
predicate on these variables. Si C x • • • x . A binary CSP is one in 
which all the constraints are defined over pairs of variables. Associated with 
every binary CSP is a constraint graph with a node for every variable and an 
edge between two nodes if their variables share a constraint. 

Dechter and Pearl propose the transformation of any CSP Pi into its 

dual form, i.e., a new CSP P 2 where constraints are now variables with struc- 
tured domains and variables are now the constraints. Although binary and non- 
binary constraint representations have been shown to have equivalence [H], it 
is not necessarily the case that one must abandon the study of either of these. 
Many problems are expressed naturally in one of these forms and trying to rep- 
resent them in the other will be unnatural [T]. Using the standard definition of 
a labeled graph as a triple {N,A,l), where N is the set of nodes, A is the set 
of arcs and I represents the label on each arc, we can define two formalisms to 
represent CSPs. 

Definition 1. Civen a binary CSP, the primal constraint graph associated 
with it is a labeled constraint graph, where N=V, (vi,Vj) G A iff 3Cij G C | 
Vj = {vi,Vj}. Also the label on arc (vi,Vj) is Cij. Civen an arbitrary CSP, 
the dual constraint graph associated with it is a labeled graph, where N=C, 
{Ci, Cj) &A*^ Vi Also the label on arc {Ci, Cj) isViC\Vj. 

Intuitively |A^|=|U|, and |A| = |C|, and an arc a € A that connects two vari- 
ables connected by constraint c, is labeled by the definition of c. This representa- 
tion is good for binary CSPs but is not useful for general CSPs. The primal graph 
for higher order CSPs is a hyper-graph. The dual graph constraint network can 
be solved by techniques that are applicable to binary networks by considering 
the constraints as the variables and tuples that instantiate them as the domains. 

Definition 2. If Vi and Vj are sets of variables, let Si be an instantiation of the 
variables in Vi. is the tuple consisting of only the components of Si that 

correspond to the variables in Vj. This is also called the projection of tuple Si 
on the variables in Vj. Let Ci,Cj be two constraints G C. T/ie join ofCi,Cj, 
denoted by Ci N Cj = Cij, is the set {t \ t G Sij A {t\Vi\ G Si) A {t\Yj] G Sj)}. 

The set of all solutions of a constraint satisfaction problem, is equal to the 
join of the relational instances corresponding to the constraints mB- This is the 
basis for the constraint directed backtracking {CDBT) algorithm. In this paper 
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we extend the dual graph based constraint directed backtracking for partial 
constraint satisfaction in the same way as the branch and bound extension to 
backtracking (based on the primal graph) provided by Freuder and Wallace [7j- 

3 Constraint Directed Backtracking 

An assignment of values to a subset of variables is called a partial assignment or 
a tuple. A tuple, ti, is consistent if it satisfies all the constraints whose variables 
are completely instantiated by ti. This set of constraints whose variables are 
completely instantiated is called the constraint elimination set of the partial 
assignment. More formally, given a CSP P = {V,D,C), and a tuple ti on a set 
of variables Vt^ C V, the constraint elimination set of ti denoted by CESt^ is 
defined as the set {Ch | C?, G C, 14 C Vt-}. A partial instantiation is consistent 
if it satisfies all the constraints in its constraint elimination set. Such a partial 
instantiation is called a partial solution. A complete solution is a consistent 
instantiation of all the variables. 

Definition 3. Consider a tuple ti as a consistent instantiation of variables in 
Vt- . An extension of U to variables in Vt- U Vt^ is a tuple ty where Uj is an 
instantiation to variables in Vt- U Vt . . The two tuples 4 and tj are compatible 
if ti[Vt^ n Vfj]=4[Vt. n Vfj], i.e., the two tuples agree on values for all common 
variables. The tuple ti N tj is a consistent extension of ti if ti and tj are 
compatible and ti N tj satisfies all the constraints in CESt^^ . 

Definition 4. Let Ccover = {Ci, C 2 , . . . , Cm}. Each Ci G Ccover is given as 
{Vi, Si), where Vi C V. Ccover covers V fjff UHi Ccover is a constraint 

cover of V. As well, Ccover is a minimal constraint cover of V if it is a 
constraint cover ofV and no proper subset of Ccover is a constraint cover ofV. 

The algorithm has a simple recursive structure similar to BT. Figure [U de- 
scribes a revised version of constraint directed backtracking. CDBT finds a con- 
straint cover from the given set of constraints that covers all the variables of the 
CSP. CDBT with a conservative constraint selection strategy can be shown to 
always have a smaller search space than BT but often a more aggressive heuristic 
can perform better on average, although no worst case performance guarantees 
can be provided. Without loss of generality, we assume that every CSP is given 
in a form such that it has a constraint cover which is a subset of the problem’s 
constraints 0 

Initially Cons represents the set of all constraints, while Tuples represents 
the set of values (tuples) in the domain of the currently selected constraint. At 
each step the selected constraint is joined with the previously constructed partial 
solution. Each join covers some new variables which were previously uncovered. 
The algorithm terminates when all the variables have been covered. At each 



^ In any case, one can always add dummy unary constraints on single variables (which 
allow all possible values in the domain of the variable). 
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Input: CSP P = {V, D, C) 

Output: CurrentSol with {\VcurrentSoi\=\V\) 




procedure CDBT (.CurrentSol, Cons, Tuples) 

begin 




1. 


if Cons = 0 then // All Constraints covered 


2. 


return ‘ ‘finished’ ’ 




3. 


else if Tuples = 0 then //No values in 


current constraint left 


4. 


return ‘‘continue’’ 




5. 


else 




6. 


CurrTup := First Value in Tuples 




7. 


NewSol := CurrentSol N CurrTup 




8. 


CESne-wSoI := Constraints completely 


instantiated by NewSol 


9. 


Foreach New Constraint Ci in CESnewSoI 


10. 


if NewSol is consistent with Ci and 




CDBT(NewSol, Cous-CESnewSoI , 


Txi'plsSCJ onSfiext'^ 




= ‘‘finished’’ then return ‘‘finished’’; 


11. 


else // Otherwise back up, to try a different value 


12. 


return CDBT (Curr Sol , Cons ,Tuples-CurrTup) 


end 







Fig. 1. Revised CDBT algorithm 



step of the algorithm CDBT removes constraints that are verified against from 
Cons. Cons therefore represents the set of constraints that have not yet been 
satisfied by the current instantiation. When Cons=%, a solution that satisfies all 
the constraints has been found. Backtracking occurs when Tuples=^, i.e., there 
are no values in the domain of the current constraint to select from, or when the 
current instantiation is inconsistent. 

As with BT, CDBT works through a search tree, although the notions as- 
sociated with the search tree in CDBT are different than those associated with 
the BT search tree. Every level of the search tree, corresponds to an extension 
to a partial assignment, constructed at a higher level in the tree. Every node of 
the tree corresponds to a consistent extension to a tuple at the previous level, 
thereby generating a new tuple. Leaf nodes correspond to tuples that cannot be 
extended any further or tuples that are solutions. The length of a path from root 
to a solution leaf node, is the number of constraints in the selected covering set. 
If the selected covering set is a minimal constraint cover then the maximum path 
length is less than or equal to \V\. (Equality occurs when dealing with unary 
constraints.) Any leaf node that cannot be extended causes CDBT to backtrack, 
while solution leaf nodes cause CDBT to terminate. 
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4 Example 

In this section we will compare the execution of the CDBT algorithm with that 
of the BT algorithm, for a toy example of a robot clothing problem, which was 
described by Freuder and Wallace [3|- The CSP is described in Figure |2] The 



Cl 




V2 C2 V3 (Denims, White), { (Cordovans,White) ) 

(Dress Blue, White)) 

C2 



Fig. 2. A robot clothing example (Primal and Dual Constraint Graphs) 



problem is described as a constraint graph, where nodes represent variables, and 
edges represent constraints between pairs of variables. There are 3 variables, and 
3 constraints. The domains of each of the variables are labeled on each node. 
The satisfying tuples in each of the constraints are represented by the sets that 
are labeled on the edges. Also given is the dual constraint graph representation 
for the CSP, where the nodes are the constraints, and an edge between two 
constraints indicates that the two constraints share variables, while the label 
on the edge between them represents the common variables between these two 
constraints. The execution of 5 T on this problem, is shown in Figure The 





Fig. 3. BT and CDBT on Robot Clothing 



internal nodes of the BT search tree are consistent nodes, while the leaf nodes 
are either solutions or dead-ends. In this figure, the X represents a dead-end, 
and the algorithm is forced to backtrack here. Consistent nodes are depicted 
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with the circle around them. As seen there is no complete solution to this CSP, 
and BT explores the complete search tree, to report the absence of a solution. 
BT visits 12 nodes in all and performs 11 constraint checks [7|- As seen, each 
level of the search tree represents a problem variable, and each node represents 
an assignment of a new value to a variable. 

When CDBT is executed to solve this CSP, the search tree that is formed is 
as shown in Figure [S] As mentioned earlier, this search tree is slightly different 
from the previously described BT search tree. A node in the CDBT search tree 
corresponds to a tuple (an instantiation to a set of variables). Internal nodes are 
consistent tuples, while leaf nodes correspond either to tuples that cannot be 
extended further thereby causing CDBT to backtrack (represented here by X), 
or tuples that are solutions. Consistent nodes are again represented by circles. 
Each level of the tree represents the choice of a constraint to extend a tuple. 
Each node represents the specific extension of a tuple by joining it with another 
tuple from the selected constraint (This is using the N operator). 

In this execution, CDBT selects C\ and C 2 as the constraints that form the 
cover, and the execution visits 4 nodes and performs 2 constraint checks before 
reporting the absence of any solutions to the CSP. It can be seen that if t\ is se- 
lected from Cl and t 2 is selected from C 2 , ti 2 =t\ N t 2 - Also CCS't^ 2 ={Ci, C 2 , C 3 } 
The consistency of the extension is verified against C 3 before extending it any 
further. The performance of CDBT can be improved by careful selection of the 
constraints to form the cover. It can be shown that if the selection were Ci,Cs 
instead, CDBT would visit 3 nodes in all, and if C 2 ,Cs were selected, 5 nodes 
would be visited. 



5 Partial Constraint Satisfaction 

In this section, we present a branch and bound variation (see FigureS]) of CDBT, 
constraint directed branch and bound (CDBB), which can be used for maximal 
constraint satisfaction — it finds a partial solution to a CSP that violates less 
than a given number of constraints. This algorithm extends CDBT in a similar 
fashion to Freuder and Wallace’s |Z] extension of BT. 

As with the Freuder and Wallace algorithm, CDBB allows for the specifi- 
cation of two kinds of bounds on solution quality. B„ecc is a necessary bound, 
and can be set if it is known in advance that there is a solution that violates 
less than Bnecc constraints. It can also be used to specify a requirement that 
only solutions that violate less than Bnecc constraints are to be reported by the 
algorithm. If no information about this is initially available, one can set Bnecc 
to infinityjl The second bound, Bsuff, corresponds to a sujficient bound. It can 
be used to specify that a solution that violates no more than Bguf / constraints 
is satisfactory. 

^ Cl and C2 need not be verified against since the tuples selected for the join, were 
selected from these constraints. 

® Infinity, of course, is the same as saying any number of constraint violations and 
hence one could set Bnecc to be equal to the number of constraints in the CSP. 
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Input: CSP P = {V, D, C) 




Output: CurrentSol with {\VcurrentSoi\=\V\) 




procedure CUBE (.CurrentSol, Distance, Cons, Tuples) 




begin 




1. 


if Cons = 0 then // All constraints covered 




2. 


BestSolution := CurrentSol 




3. 


Bnecc : = Distance 




4. 


if Bnecc < Bsuff then return ‘ ‘finished’ ’ // Acceptable 


solution 


5. 


else return ‘ ‘continue’ ’ 




6. 


else if Tuples = 0 then //No values in current constraint 


left 


7. 


return ‘‘continue’’ 




8. 


else if Distance > Bnecc then 




9. 


return ‘‘continue’’ 




10. 


else 




11. 


New Distance := Distance 




12. 


CurrTup := First Value in Tuples 




13. 


NewSol := CurrentSol N CurrTup 




14. 


CESncwSoI := Constraints completely instantiated by NewSol 


15. 


Foreach New Constraint Ci in CESncwSoI 




16. 


if NewSol is inconsistent with Ci 




17. 


New Distance := New Distance + 1 




18. 


if NewDistance < Bnecc and 






CBBB(NewSol ,NewDistance yCons-CESNemSoi > Tuplescons 

= ‘‘finished’’ then return ‘‘finished’’; 


next ^ 


19. 


else // Otherwise back up, to try a different value 




20. 

end 


return CBBB (C ur r S ol , Distance, Cons ,Tuples-CurrTup) 





Fig. 4. CDBB algorithm 



Definition 5. A tuple ti is defined to he par-consistent if the number of in- 
consistencies of the tuple U with respect to the constraint elimination set of Vt- 
is less than a pre- determined hound (Bnecc)- 

Definition 6. Consider two tuples ti and tj. A directional join of U and tj, 
tij=ti N tj is defined as follows. If tfiVt- C\Vtfi = tj\Vti CVtfi, ti N tj = ti N tj. 
Otherwise Vt,-=Vt, OVt-, Uj[Vt,] = U[Vt,], Uj[Vt, nVtfi = U[Vt, CVtfi, Uj[Vt- - 
(VunVtfi] =tfiVt^-(VunVt^)]. 

Intuitively the N operator performs a normal join of two tuples U and tj when 
ti and tj are compatible and a directional join of tuples fi and tj when they 
are not compatible. It is clear that {U N tj) fiz ftj N tj), unless ti\Vti D Vtfi 
= tj \Vti OVtfi. The directional join of tj and tj maintains all assignments to all 
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variables in ti in ty , and for the remaining variables in tij, it assigns the values 
for those variables from tj . The directional join guarantees that assignments to 
variables earlier on are not erased by joins performed later on. Since ti N tj is 
not necessarily equal to tj N ti, the order of selection of constraints can affect 
the performance of the algorithm. 

Definition 7. Consider a tuple ti as a par-consistent instantiation of vari- 
ables in Vt- ■ An extension of ti to variables in Vt- U is a tuple tij where 
tij is an instantiation to variables in Vt, U Vt ^ . ty =U N tj is defined to be a 
par-consistent extension of ti if the number of violations of tij w.r.t to the 
constraints in CESt,^ is less than Bnecc- 

Distance corresponds to the number of violated constraints carried by the 
partial solution obtained at any given stage. Best-Solution contains the best 
solution that has already been found at any given instant. The algorithm can 
therefore be stopped at any time, and it can return the best known result found 
till then. Hence it can be used as an anytime algorithm, when computational 
resources for the program are limited. When Bguff is set to 0, the algorithm 
becomes similar to the previous CDBT algorithm for regular CSPs. 

5.1 Example Revisited 

In this section, we illustrate the execution of the branch and bound algorithms 
on the CSP described earlier in the paper. Figure El shows the execution of 





Fig. 5. BT with branch and bound and CDBB on Robot Clothing 



backtracking extended with branch and bound for partial constraint satisfaction 
as described in [Zj. Again the X depicts a dead-end in search, and causes the 
algorithm to backtrack. The point to be noted is that the algorithm backtracks 
when the number of inconsistencies is increased above the best complete solution 
that has been recorded so far. BTBB visits 14 nodes in all, performs 15 constraint 
checks and finds a solution that violates 1 constraint as the best solution in terms 
of least constraint violations. As before, search concludes when a satisfactory 
solution is found, or all choices for future values for variables are exhausted. 
Nodes with a dashed circle around them denote inconsistent nodes, whose level 
of inconsistency is acceptable. 
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In Figure 13 we show the execution of CDBB on the same example. As before 
CDBB selects constraints C\ and C2 to form the cover. It visits 4 nodes in all 
and performs 2 constraint checks. As before, if a different set of constraints is 
selected to form the cover, CDBB visits a different number of nodes each time. 
It turns out that if C\ and C3 are selected, CDBB visits 3 nodes in all, and if 
C 2 C3 are selected it visits 6 nodes in all. The algorithm finds a similar solution 
that violates 1 constraint. 

6 Analysis of CDBB 

If Ccover is a minimal constraint cover, \Ccover\ < \V\. This can be seen by the 
fact that while selecting constraints to form the cover, each additional constraint, 
covers at least one new variable, and hence all that is required is a cover of size 
of at most \V\. Although the size of a minimal constraint cover is upper bounded 
by \V\, in practice in CSPs of higher arities, this number is even less. In fact, if 
Ccover is a minimal constraint cover of a CSP of arity fc, \Ccover \ < n-k+1. This 
again follows trivially from the fact that a CSP of arity k has at least one con- 
straint of arity k. Including this constraint in a minimal constraint cover Ccover 
covers k variables. This leaves n-k variables to be covered. In the worst case, 
n-k constraints are required to cover these variables. So the total number of con- 
straints in a minimal constraint cover is < n-k+1. Clever selection of a minimal 
constraint cover can improve the search complexity of the CDBB algorithm. 

The CDBB (as also CDBT) algorithm assumes that the constraints of the 
CSP are given in extensional form. But it should be noted that the only con- 
straints that need to be given in extensional form are the ones that are selected 
to form a minimal constraint cover. This is because all the other constraints are 
only “checked” against, and the individual tuples in those constraints are not 
required to be enumerated. As mentioned above, this number of constraints in a 
minimal constraint cover is less than \ V\. In a binary CSP with |y| variables the 
number of constraints can be as high as |Cp. (For k-ary constraints this number 
is even higher). But CDBB (and CDBT) only need to consider less than \V\ of 
these constraints as extensional. This also limits the depth of the backtrack tree 
of CDBB. 

To discuss the performance of CDBB, and to show its soundness we will need 
a few lemmas. 

Lemma 1. If the level of inconsistency of a node is nine, then the level of in- 
consistency of all of its child nodes is > nine- 

Proof: The proof follows from Definitions and 0 Extensions cannot remove 
inconsistencies. □ 

Lemma 2. At every level, any constraint that is satisfied at this level, will re- 
main satisfied by all of its child nodes. 

Proof: The algorithm attempts to construct a par-consistent extension of a 
tupi. tupi is a partial assignment on a variable set Vt,. Consider an extension 
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to this tuple, tupnew=tupi N tupi+\. This extension is an assignment on a vari- 
able set Vtiip„era = U Vtupi+i ■ By definition of the par-consistent extension, 
for all variables in Vt^ n Vtupi+i, tupnew\Vu n Vtupi+i] = tupi\Vti n Vtupi+i\- Also 
Vu/j; G VtupneTji I ^ ^tupi^ii (t’/c G Vti ^ t'^Pnew['^tk]~^'^Pi['^tk]) ^ iVk G 
Vtupi+i tupnew[vtk\='tupi+i['<^tk]) I i-G-, all previous assignments to variables re- 
main unchanged, and hence all constraints that were previously satisfied remain 
satisfied in the par-consistent extension. □ 

Definition 8. A solution to a PCSP is defined as an n-ary tuple, whose in- 
eonsisteney eount is < Bnecc- Conversely a solution to a PCSP is an n-ary tuple 
that satisfies > \C\ — Bnecc constraints, where \C\ is the number of constraints 
in the PCSP . 

Theorem 1. CDBB is sound. 

Proof: If CDBB outputs a complete assignment, then every inner node, along 
the path from the root to the node most recently visited, has an acceptable 
inconsistency count. (This follows from the previous two lemmas). Also if the 
parent node has an acceptable inconsistency count, the par-consistent extension 
of it will be output only if it also has an acceptable inconsistency count. Since 
this is a complete assignment, it is a solution of the PCSP. Hence the proof. □ 

7 Results 

Three hundred random CSPs were generated using a probability of inclusion 
model of generation as described in [7]. Problems had 20 to 1000 variables; the 
maximum arity varied between 2 and 4; the domain size varied between 2 and 
10; the number of constraints varied between 40 and 4500; and the tightness of 
each constraint varied between 0.3 and 0.7. 



Anytime curves, IVI=(20-25}, ICI=|40-50} 




Anytime curves, IVI=1000, ICI=4500 




Time 



Fig. 6. CDBB v/s BTBB (sample size=100 and sample size=20) 



Each algorithm {BTBB and CDBB) was executed for a fixed period on each 
problem. The number of violations in the best solution found so far was recorded 
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at fixed intervals. These recorded values were averaged across a number of prob- 
lems in the same class. The results are displayed as anytime curves m in 
Figure The graphs indicate the quality of solutions found (on the y-axis, 
labeled as violations) and the time taken in seconds (on the x-axis). Similar 
results were obtained for other problems (not shown). For all the problem sets, 
CDBB consistently found better solutions (i.e., solutions with fewer violations) 
more quickly than BTBB . The graphs display the quality of solutions obtained 
after a given interval. The intermediate solutions are also recorded at fixed in- 
tervals. 

8 Discussion 

The CDBB algorithm is used for maximal constraint satisfaction and can also 
return partial solutions in an anytime fashion. In this paper we deal with the 
case where all constraints are considered soft, although incorporating hierar- 
chies of constraint importance is currently underway. In this paper we describe 
an algorithm that is able to find a solution that violates as few constraints as pos- 
sible using a branch and bound strategy. The algorithm has been implemented 
and tested on moderately sized problems (e.g., 20-1000 variables, 40-4500 con- 
straints with arity 2-5 and domain sizes 2-10). We compare the execution of the 
CDBB algorithm with that of the BTBB algorithm. Some may consider BT to 
be a straw-man, but BT (or its extension BTBB) is the appropriate algorithm 
to compare against CDBT (or its extension CDBB). Just as BT is the base al- 
gorithm of a family of algorithms {BJ,CBJ) using the primal graph, CDBT is 
the base of another family of algorithms {CDBJ,CDCB,Tj using the dual graph. 
We also intend to conduct similar comparisons between the other correspond- 
ing members of these two families, i.e., BJBB v/s CDBJBB, etc. BT is normally 
enhanced with constraint propagation techniques like AC, MAC, FC. Similarly 
CDBT can be enhanced with constraint propagation (w-consistency) techniques 
in the constraint directed scheme HI] [IMS]. Consistency in binary CSPs is well 
studied and many of the algorithms have been generalised to general CSPs too, 
but many of these generalisations lose some functionality. Often enforcing such 
consistencies is not possible (intractable or too expensive) or does not perform 
well. On the other hand, w-consistency is not an extension of arc consistency, 
and therefore is not the same as arc consistency when applied to binary CSPs. 
This paper presents an extension to CDBT to handle the over constrained-ness 
of problems. It can still be furthered by incorporating w-consistency enforce- 
ment as part of the search procedure. CDBT can also be extended by intelligent 
backtracking schemes like BJ,CBJ. This paper attempts to provide the motiva- 
tion towards algorithms that are based on the dual graph, for handling over- 
constrained CSPs. It is also to be noted that heuristics for value ordering can 
be applied to the tuples in the domains of the various constraints, and this will 
aid in specifying the preference of one tuple over another in an instantiation of 
variables. 
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9 Conclusions and Future Directions 

In this paper we have provided a branch and bound extension to CDBT which 
is applicable to over-constrained problems. We have shown that the algorithm is 
sound, and that its performance is competitive with existing algorithms of the 
same kind. Specifically, our results indicate that 

1. CDBB consistently provides better solutions and finds these by exploring 
less nodes in the search space. 

2. The CDBB algorithm can be used as an anytime algorithm to provide partial 
solutions almost immediately. 

3. Since the algorithm is based on constraint selections, preferences between 
constraints (hierarchies, soft versus hard constraints) can be reflected in 
the ordering of these constraints in the selection. Constraint hierarchies can 
also be dealt with by assigning a cost to the violation of a constraint, and 
then replacing Bnecc with a cost function for branch and bound. Preferences 
between tuples in a constraint can be incorporated into the search as value 
ordering heuristics directly. Preferences between constraints can be reflected 
in a constraint ordering. 

It will be interesting to investigate intelligent backtracking schemes like back- 
jumping, conflict directed backjumping and dynamic backtracking based on the 
constraint dual graph. Also CDBB can be applied to real world over-constrained 
domains and its performance compared with successful local search methods. 
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Abstract. Automatically extracting keyphrases from documents is a task with 
many applications in information retrieval and natural language processing. 
Document retrieval can be biased towards documents containing relevant 
keyphrases; documents can be classified or categorized based on their 
keyphrases; automatic text summarization may extract sentences with high 
keyphrase scores. 

This paper describes a simple system for choosing noun phrases from a 
document as keyphrases. A noun phrase is chosen based on its length, its 
frequency and the frequency of its head noun. Noun phrases are extracted from 
a text using a base noun phrase skimmer and an off-the-shelf online dictionary. 

Experiments involving human judges reveal several interesting results: the 
simple noun phrase-based system performs roughly as well as a state-of-the-art, 
corpus-trained keyphrase extractor; ratings for individual keyphrases do not 
necessarily correlate with ratings for sets of keyphrases for a document; 
agreement among unbiased judges on the keyphrase rating task is poor. 



1 Introduction 

Keyphrases for a document are useful for many applications. For text retrieval 
keyphrases can help narrow search results or rank retrieved documents. They can be 
used to cluster semantically related documents for the purposes of categorization. 
They can also be used to guide automatic text summarization. 

In our Knowledge Acquisition and Machine Learning group we have been 
working on a system to generate summaries of documents automatically using a 
modular design lUEIIzl The modular design divides the summarization task into 
several parts; keyphrase extraction, text segmentation, segment classification, 
sentence scoring and selection, etc. For each part, any one of several systems could be 
plugged in. Furthermore, each module has parameters that could be set empirically. 
The intent in the project is to use machine learning to configure the system (select 
modules and set parameter values) to produce the “best” extracted summaries. 

To increase flexibility in the configurability of the system, we would ideally have 
a number of different modules that could be plugged in at the appropriate points in the 
greater system. For keyphrase extraction, we are using Peter Turney’s Extractor 
|1 1|. As an alternative we decided to build a simple keyphrase extractor in-house as 
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well. The goal was to keep the extractor simple and to apply any linguistic insight we 
might have to the process. 

This paper presents our simple keyphrase extractor (herein referred to as B&C for 
lack of a better name). Section describes how B&C extracts noun phrases from a 
document and scores them based on their frequency and length, taking into account 
the frequency of noun phrase heads. We have conducted experiments to compare 
B&C to Extractor on the level of individual keyphrases as well as on the level of 
complete, coherent keyphrase sets. The experiments (using human judges) and results 
are described in sections 0and[] Unfortunately, the costs involved in experiments 
involving human judges limit the scope of experimental evaluation. Our experiments, 
therefore, are restricted to a comparison of B&C and Extractor on a small number of 
documents. Comparisons of Extractor to other keyphrase extraction systems can be 
found in 10- As usual, many other experiments can be imagined and should be 
carried out (see section^. 

The experiments suggest that B&C and Extractor perform differently, but about 
equally well. Our judges preferred individual keyphrases from Extractor more often 
but complete sets from B&C more often. A low degree of agreement between judges 
prevents sweeping conclusions about the superiority of one system over the other. 



2 Related Work 

Krulwich & Burkey extract “semantically significant phrases” from documents 
based on the documents’ structural and superficial features. A phrase is some small 
number of words (one to five, for example). Phrases are chosen using several 
heuristics. For example, phrases occurring in section headers are candidate significant 
phrases, as are phrases that are formatted differently than surrounding text. The 
purpose of extracting such phrases is to attempt to determine a user’s interests for 
information retrieval automatically. 

Turney’s Extractor extracts a small number of keyphrases from documents. 
Relevant keyphrases are chosen from a list of candidate phrases: all sequences of a 
small number of words (up to about five) with no intervening stop words or 
punctuation. The stop word list consists of closed category words (prepositions, 
pronouns, conjunctions, articles, etc.) as well as a few very general open category 
words (verbs, nouns, etc.). Keyphrases are selected by scoring candidate phrases on a 
number of features (such as frequency of the stemmed words in the phrase, length of 
the phrase, position of the phrase in the document, etc.). Features likely to produce 
keyphrases that match authors’ keyphrases for a document were determined 
automatically using a genetic algorithm. Although Extractor has been evaluated 
primarily by comparing extracted keyphrases to authors’ keyphrases, it has recently 
been evaluated by human judges as well. The web version of Extractor produces a set 
of keyphrases and the user is invited to mark each keyphrase as “good” or “bad”. 
Results so far give 62% “good” phrases, 18% “bad” and 20% “no opinion”. 

The Kea system ||]^ uses two features to determine if a candidate phrase is a good 
keyphrase. Candidate phrases are sequences of consecutive words (usually no more 
than three) with no intervening phrase boundary indicators (such as punctuation). 
Proper names and phrases beginning or ending with stop words are excluded. 
Subphrases of a candidate phrase may appear as separate candidate phrases. The first 
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feature used in selecting keyphrases is TFxIDF (term frequencyxinverse document 
frequency), which favours phrases that appear frequently in the current document and 
infrequently in general usage. Frequency in general usage is determined by frequency 
in a “global corpus” (a large, general purpose corpus). The second feature is distance 
from the beginning of the document. Feature values are calculated for all candidate 
phrases in documents in a training corpus (documents for which authors’ keyphrases 
are available). Each candidate phrase is then marked as a positive example if it is 
among the author’s keyphrases or as a negative example. A Naive Bayes technique is 
used to assign weights to the features, based on the feature values of the positive and 
negative examples. Experiments on unseen documents compare ext racte d keyphrases 
to authors’ keyphrases. The performance is statistically equivalent ([n]| to Extractor. 
The Kea group recognizes the limitations of evaluating keyphrases relative to author- 
supplied keyphrases and plans to do further evaluation using human judg es t o rate 
“how well a set of extracted keyphrases summarize a particular document” (||]^). 



3 Extracting Keyphrases 

Our system for extracting keyphrases from documents proceeds in three steps: it 
skims a document for base noun phrases; it assigns scores to noun phrases based on 
frequency and length; it filters some noise from the set of top scoring keyphrases. 



3.1 Skimming for Base Noun Phrases 

Most of our work in knowledge acquisition from texts processes parse trees generated 
by the DIPETT parser j^. Eor the task of extracting keyphrases, full, detailed parses of 
complete English sentences are not needed. To avoid the overhead associated with 
deep parsing, we decided to implement a simple base noun phrase skimmer instead. 

A base noun phrase is a non-recursive structure consisting of a head noun and zero 
or more premodifying adjectives and/or nouns. The base noun phrase does not include 
noun phrase postmodifiers such as prepositional phrases or relative clauses. A base 
noun phrase skimmer proceeds through a text word-by-word looking for sequent^ of 
nouns and adjectives ending with a noun and surrounded by non-noun/adjectives.*^ 

Such a skimmer requires knowledge of the parts of speech of the words in the text. 
One possibility would be to tag the text using a tagger (such as the widely used Brill 
tagger 0 A tagger assigns the most likely single part of speech tag (noun, adjective, 
verb, etc.) to each word in a sentence. We decided to use a simple dictionary lookup 
instead. The main advantage of a dictionary lookup is that our online dictionaries list 
the root form of each word, allowing us to treat such phrases as good schema and 
better schemata as instances of the same root phrase. 

The skimmer uses two dictionaries: our own DIPETT dictionary, which is fairly 
complete for closed class words (articles, prepositions, conjunctions, etc.); and the 
Collins wordlist, a large list of English words with all possible parts of speech for 
each word (and then some). If a word appears in DiPETT’s dictionary as a closed 



1 



More sophistication is possible by looking specifically for noun phrase “sun'ounders” such 
as articles, prepositions, verbs, etc, or by allowing other elements in the base noun phrase 
such as possessives, conjoined premodifiers, etc. 
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category word, it is tagged <closed>. Otherwise, if the word can be a noun according 
to Collins, it is tagged <noun>; if it can be an adjective it is tagged <adjective>. The 
check in DiPETT’s dictionary is required since the Collins list contains some 
questionable entries (such as a as a preposition, noun and article). 



3.2 Counting Noun Phrases 

In this section we describe the formula for choosing noun phrases (NPs) as 
keyphrases. Systems that choose keyphrases on frequency alone take the most 
frequently occurring phrases in a document. Our decision to take the frequency of a 
noun phrase’s head noun into consideration was based on the following observations: 

1. Longer noun phrases (with more premodifiers) are more specific and may be 
more relevant to a particular document than more general, shorter noun phrases. 

2. In the interest of economy (and ease on the reader), long noun phrases are usually 
not repeated frequently in a document. For example, an article about the 
Canadian Space Agency may use that phrase once, with subsequent references 
reduced to the Space Agency or even the Agency. 

Here is our algorithm for assigning scores to noun phrases: 

1. freqn = the number of times noun H appears in the document as the head of a 
noun phrase 

2. take the top N heads with the highest discard the rest of the heads 

3. for each head //, e 

i) recover all complete noun phrases NP]..NPm having //, as head 

ii) for each NPj e NPj-.NPm calculate NPj’s score as its frequency times its 
length (in words) 

4. keep the top K highest scoring noun phrases as keyphrase candidates for the 
document 

In steps 1 and 2, discarding relatively infrequently occurring heads allows less 
frequent noun phrases (with frequent heads) to compete in steps 3 and 4. For example, 
head Hj may occur more frequently than any of the complete noun phrases having H 2 
as head. But if H 2 occurs as head more frequently than Hi, Hi may be discarded in 
favour of 7 / 2 ’ s noun phrases. 

The algorithm allows for many variations, some of which we considered. For 
example, in step 3 we considered taking exactly one NP (the top scoring NP) for each 
of the N most frequent heads, disallowing more than one keyphrase with the same 
head. We decided, however, to allow multiple keyphrases with the same head. One 
can imagine documents for which laser printer and colour printer would both be 
useful keyphrases. Such biases should be experimentally validated. 

The thresholds N and K should be set according to heuristics (based on document 
length or as a percentage of distinct heads), or set by the user as a parameter, or 
determined empirically. For example, for all the noun phrases in step 3, if there is a 
gap in the scores between the higher scoring and lower scoring NPs, the threshold 
could be set at the gap. These thresholds could also be set according to results of 
evaluations: is there a threshold beyond which keyphrases rate poorly. For the 
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experiments we set N and K arbitrarily to the maximum number of keyphrases 
produced by Extractor for our test documents (twelve). Setting N and K higher would 
produce more low-scoring keyphrases (i.e., shorter phrases and those with less 
frequent heads). 

3.3 Postprocessing 

Once the algorithm has produced the top K keyphrase candidates for a document, we 
apply two simple postprocessing filters: remove single letter keyphrases; remove 
wholly-contained subphrases. 

Single letter keyphrases are an artifact of the Collins dictionary lookup, and could 
be filtered out prior to keyphrase selection. Normally one might consider ignoring 
single letter words altogether (as Extractor does). Previous investigations into the 
semantics of noun phrases |^, however, suggest that some single letter words ^ 
relevant in noun phrases (SCSI D connector, Y chromosome, John F. Kennedy, etc.).^ 

Removing wholly-contained subphrases is intended to prevent both a phrase and a 
generalization of the phrase (a subphrase) from appearing as keyphrases when both 
have high scores (e.g., theoretical Computer Science and Computer Science). It is 
easy to invent examples where both a phrase and a wholly-contained subphrase would 
make good keyphrases for a document. But in general, given a coherent set of 
keyphrases, we decided that subphrases would contribute little. This decision could be 
added to the growing list of choices to be validated experimentally. 



4 Experiments 

4.1 Using Human Judges 

We conducted two experiments using human judges to compare our keyphrases to 
those produced by Extractor. Our previous experiences using human judges have 
taught us that using human judges should be avoided. Making the necessary 
judgments is usually a difficult, time-consuming and energy-consuming process. 
Drawing statistically significant conclusions from such experiments can be difficult 
because there is a limit on the amount of data that can be collected. Nonetheless, 
automatic evaluation of keyphrases would require some gold standard set of 
keyphrases for a document, and these simply do not exist. Other researchers 
have used author’s keyphrases as a gold standard. There are several problems with 
using author’s keyphrases: 

• aut hor’ s keyphrases are not always taken from the text (in experiments reported 
in ||llj, 75% of them are) 

• author’s keyphrases are often restricted to a very small number of phrases (two or 
three, for example) 
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Some writers may have a preference for D-connector or Y-chromosome, but hyphenating is 
far from universal and cannot be assumed by systems dealing with unrestricted text. 
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• author’s keyphrases are often chosen for a specific purpose (for classification 
according to an existing set of keyphrases, to steer review of a document, to 
distinguish a document from others in one specific collection of documents, etc.) 

• author’s keyphrases are usually only available for the very few kinds of 
documents for which authors supply keyphrases; these are exactly the kinds of 
documents for which we have no need for automatically generated keyphrases. 

The judges in our experiments were university faculty, postdoctoral fellows, graduate 
students and AI researchers. It is possible that this community has preconceptions 
about what makes a good keyphrase. 

4.2 Choosing Documents 

Having decided to enlist human judges in our evaluation, we had to choose a small 
number of documents. To ensure a fair comparison to Extractor, we chose nine 
documents from corpora used in training and testing Extractor. The nine consisted of 
three documents chosen at random from each of three of the five Extractor corpora. 
One of these corpora was used in training Extractor, all three were used in testing it. 
To the nine Extractor documents we added four documents from different domains 
with no particular consideration given to subject matter or style. 

For each of the thirteen documents, we extracted keyphrases using Extractor and 
B&C. The documents were then given to twelve judges who were asked to read them. 
No further instructions about what to look for in the documents were given, though 
the judges knew that they would have to rate keyphrases for them. 

After reading the documents, the judges were asked to rate keyphrases in two 
separate experiments: one to rate individual keyphrases for each document, and one to 
compare Extractor’s complete set of keyphrases for each document to B&C’s set of 
keyphrases for each document. 

4.3 Rating Individual Keyphrases 

For each document the judges were given a single list of keyphrases in no particular 
order. The list was the union of keyphrases from Extractor and B&C keyphrases 
(duplicates removed). Judges rated each keyphrase as “good”, “so-so” or “bad”, with 
minimal instructions about the definitions of those terms (to avoid biasing them 
toward a particular kind of keyphrase). 

4.4 Comparing Sets of Keyphrases 

The second experiment had the judges compare Extractor’s keyphrases to B&C 
keyphrases for each document. The two keyphrase sets were normalized (converting 
all characters to lower case, for example) and presented to the judges in random order. 

The judges were instructed to consider each of the two keyphrase sets as a 
coherent whole and to compare them to each other. They were told to mark as 
preferred the set that they felt better represented the content of the document for any 
reason. They were also given the option to mark neither as preferred if they felt that 
there was no significant difference between the two. 
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5 Evaluation 

Here we present the results of the experiments from three points of view: the straight 
numbers for the two experiments; the degree to which the human judges agreed 
amongst each other for each experiment; the correlation between the two experiments. 
We leave discussion of the results to section^ 



5.1 Individual Keyphrase Ratings 

For each keyphrase, judges’ ratings were converted to numeric scores by assigning 2 
points for a “good” rating, 1 point for a “so-so” rating and 0 points for a “bad” rating. 
The score for each keyphrase was calculated simply as the sum of the scores from all 
twelve judges. We then assigned a score to Extractor and B&C for each document by 
taking the sum of the keyphrase scores for keyphrases produced by the system divided 
by the total number of key phrases p roduced by the system for the document. The 
normalized results appear in jT able 1] 

Table 1. Document scores based on individual keyphrase ratings 





Average document score 


Standard deviation 


Extractor 


0.56 


0.11 


B&C 


0.47 


0.10 



On average. Extractor produced 6.2 keyphrases per document, B&C produced 9.1. 
Forcing B&C to produce more keyphrases may explain the lower average document 
score, assuming the extra keyphrases were the ones rated lower by the judges. The 
averag e length of keyphrases was 1.7 words for Extractor and 1.9 words for B&C. 
pig. 1| shows the average proportion of keyphrases of varying length per document. 
The figure clearly illustrates B&C's bias toward longer phrases. 




1 2 3 4 5 

keyphrase length 



Fig. 1. Average proportion of keyphrases of various lengths (per document) 
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5.2 Side-by-Side Comparison of Keyphrase Sets 

For the side-by-side comparison experiment we counted the number of times judges 
preferred Extractor s set of keyphrases, the number of times judges prefe rred B&C 
sets and the number of times neither set was preferred. The results appear in ^able_^ 

Table 2. Number of times complete sets of keyphrases were preferred 



Times preferred 


Extractor 


61/156 (0.39) 


B&C 


74/156 (0.47) 


neither 


21/156 (0.13) 



Judges preferred the longer of the two sets of keyphrases 39% of the time and the 
shorter set 40% of the time. For one document, both Extractor and B&C produced the 
same number of keyphrases. 

5.3 Inter-Judge Agreement 

In any experiment involving human judgments there must be some analysis of the 
degree to which the judges agree. 

For the side-by-side comparison judges agreed on their preferences 43% of the 
time. Of course, we would expect them to agree some of the time by chance alone. To 
correct for chance, we measured the inter-judge agreement using the Kappa Statistic 
[^, which is widely used in the field of content analysis and growing in popularity in 
the field of natural language processing. Briefly, the Kappa Statistic is a measure of 
agreement between two judges that takes into account chance. Kappa is defined as: 

_ P(A)-P(E) 

^ “ 1 - P(E) 

P(A) is the number of times the two judges agree relative to the total number of 
judgments; P(E) is the proportion of times the judges are expected to agree by chance. 
K = 0 indicates chance agreement. Notice that the definition allows for negative K 
values when judges agree less often than would be expected by chance. Normally, 
P(E) would be set to the number of combinations of identical judgments (agreements) 
divided by the total number of combinations of judgments (in our case, 1/3): 



judge 1 


Ex’ tor 


Ex’tor 


Ex’tor 


B&C 


B&C 


B&C 


neither 


neither 


neither 


judge! 


Ex’ tor 


B&C 


neither 


Ex’tor 


B&C 


neither 


Ex’tor 


B&C 


neither 




1/9 


1/9 


1/9 


1/9 


1/9 


1/9 


1/9 


1/9 


1/9 




1/9 




+ 




1/9 




+ 




1/9 



This calculation is based on the assumption that a judge is equally likely to choose 
Extractor, B&C or neither. In fact, there was definite reticence among the judges to 
choose neither. The observed likelihood that a judge would choose neither was about 
0.13, making the probability that a judge would choose Extractor or B&C 0.43 each. 
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So our estimate for P(E) is the probability that both judges choose neither (0.13x0.13) 
plus the probability that both choose Extractor (0.43x0.43) plus the probability that 
both choose B&C (0.43x0.43): P(E) = 0.39. That is, given the judges’ avoidance of 
neither, we expect them to agree by chance somewhat more often than 1/3 of the time. 
The K values reported here are lower than if we had used P(E) = 1/3, though not by 
much. Table 3 shows the K values for each pair of judges on the side-by-side 
compattaaiLsi^riment. 

Table 3. The k values for each pair of judges (k > 0 in boldface) 




The Kappa values are spectacularly low. The average K is 0.06, meaning that, jin 
average, the judges agree only about as much as can be expected by chance (k = 0).“ 
If we isolate the situations where judges did agree, we can co mpare the number of 
agreements on Extractor keyphrase sets to B&C keyphrase sets. [Table^ shows what 
the judges were agreeing on when they agreed. 



Table 4. Distribution of agreements among the three categories 



Number of agreements 


Extractor 


142 (0.39) 


B&C 


210 (0.57) 


neither 


14 (0.04) 


total 


366 



Eor the individual keyphrase rating experiment it is somewhat more difficult to 
determine the degree to which judges agreed. Judges agreed on the rating of 
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Setting P(E) to 1/3 gives an average K of 0.14, which still indicates very little agreement 
beyond what would be expected by chance. 
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individual keyphrases (as “good”, “so-so” or “bad”) 52% of the time. But we would 
expect them to agree by chance 1/3 of the time (assuming a single keyphrase is 
equally likely “good”, “so-so” or “bad”). For individual keyphrases, the average 
Kappa is 0.27, which is quite low. 

But Kappa is not necessarily a good indicator of agreement for individual 
keyphrase ratings. Given the three-point scale, it is possible that judges rated 
keyphrases for a document similarly, but not identically. For example, one judge 
might rate half the phrases “so-so” and the other half “bad”. A second judge might 
rate all of judge A’s “so-so” keyphrases “good” and all of A’s “bad” keyphrases “so- 
so”. These two judges would have the minimum Kappa of -0.5, even though their 
relative ratings were similar. Kappa requires absolute agreement of judges. 

To account for relative similarities between- judges’ individual ratings of 
keyphrases, we calculated the correlative coefficienPbetween the ratings of each pair 
of judges. The average coefficient was 0.47, indicating moderate agreement among 
judges on the quali ty of a keyphrase relative to other keyphrases. Individual 



coefficients appear in [Table 5 



Table 5. Correlative coefficients for each pair of judges 




5.4 Correlation Between the Two Evaluations 

The purpose of conducting both of the experiments we have described was in part to 
investigate the connection between the quality of individual keyphrases and sets of 
keyphrases. For example, the phrase alien abduction experience may be considered a 
“good” keyphrase for a particular document; the keyphrase experience may be 



The correlative coefficient is a measure to which the differences among data points in one 
list ai'e similar to the differences among data points in a second list, even if the data points 
are scaled differently in each list. A correlative coefficient of 0 indicates no relationship 
between the data in the two lists. Correlative coefficients of 1 and -1 indicate direct and 
inverse relationships between the data points in the two lists. 
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considered too general and rated “bad”. But if the set of keyphrases containing 
experience also contains alien and abduction, the set as a whole might be considered 
just as good as the set containing alien abduction experience. 

Similarly, a set with many weak keyphrases and many good keyphrases may rate 
poorly in the document score that normalizes for the number of keyphrases in the set. 
But in an application where recall of keyphrases is more important than precision, the 
set might be useful. 



To measure the correspondence between each judge’s individual keyphrase ratings 
and side-by-side preferences, we used individual keyphrase ratings to predict which 
set a judge would prefer. Here is the simple formula to predict if judge J will prefer 
Extractor or B&C (or neither) for document D. As previously, keyphrases rated 
“good” are given 2 points, etc. 

1 . For each keyphrase K in D 

if K is in Extractor's set, add K's points from 7’s rating to Extractor's score for D 
if K is in B&C's set, add K's points from J's rating to B&C's score for D 

2. Predict that J will prefer whichever set has the greater score for D (or neither, if 
the two scores are equal 

We then counted the number of times the prediction for each judge on each doc ument 
matched the judge’s actual preference. The proportions of matches are shown in pable 
The average was 0.51. Again we would expect that by chance a judge’s individual 
keyphrase scores would match that judge’s preferred set some of the time. The 
average Kappa measuring the agreement between a judge’s keyphrase-based 
document scores and the judge’s stated preference is 0.21. 



Table 6. Number of times individual keyphrase ratings predict keyphrase set preference 



Judge 


A 


B C 


D E 


F G 


H I 


J K L 


keyphrase 
ratings 
predict set 
preference 


0.46 


0.69 0.31 


0.54 0.31 


0.62 0.46 


0.54 0.69 


0.69 0.62 0.15 



6 Discussion 

The judges seemed on average to assign higher scores to individual keyphrases 
produced by Extractor, though not significantly higher scores. Normalizing the 
document scores by dividing by the number of keyphrases produced should corTect 
for any advantage B&C gained by producing more keyphrases. The fact that judges 
preferred short sets as often as long sets (40% vs. 39%) suggests that having more 
keyphrases was not necessarily an advantage to B&C. 

For the side-by-side comparison of keyphrase sets, judges more often preferred 
B&C keyphrases, despite the fact that Extractor's individual keyphrase ratings were 
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higher and the fact that B&C’s keyphrase sets were longer. This can be accounted for 
by looking at inter-judge agreement on the side-by-side comparisons; when more 
judges chose B&C keyphrase sets, they chose them as a larger majority; when more 
judges chose Extractor keyphrase sets, there was more disagreement. 

The discrepancy between the results of the two experiments is supported by the 
weak correlation between individual keyphrase-based document scores and keyphrase 
set preferences. Judges did not prefer keyphrase sets based simply on the individual 
keyphrases they contained. A set of keyphrases is somehow more than the sum of its 
individual keyphrases. 



7 Future Considerations 

The B&C system is overly simplistic. The noun phrase skimmer could be improved 
(or we could go back to DIPETT for better noun phrases). System parameters (such as 
the number of heads considered (AO and the number of keyphrases generated (K)) 
should be set according to empirical observations, perhaps as the result of a machine 
learning experiment. New parameters, such as phrase position in the document, could 
be added. One side-trip experiment showed little difference between keyphrases 
extracted from the whole document and those extracted from the first half only. 

More evaluation is also needed. The low inter-judge agreement (due in part to the 
unconstrained nature of our experiments) suggests that a more directed experiment is 
required; one with a particular application of keyphrases in mind. Other experiments 
are required to evaluate design decisions in isolation (such as the decision to allow 
multiple keyphrases with the same noun phrase head). 

A more ambitious project would be to plug the different keyphrase extractors into 
a larger system. How would different keyphrases affect sentence extraction in a text 
summarization system, for example? It would also be interesting to adjust the 
keyphrase selection algorithm to allow for compound heads; theoretical natural 
language processing and empirical natural language processing are kinds of natural 
language processing, not just kinds of processing. 



8 Conclusions 

In this paper we have presented a simple system for extracting keyphrases 
automatically from documents. It requires no training and makes use of publicly 
available lexical resources only. Despite its lack of sophistication, it appears to 
perform no worse than the state-of-the-art, trained Extractor system in experiments 
involving human judges. 

More importantly, however, experiments show that judges do not necessarily 
consider the quality of sets of keyphrases as a simple function of the quality of 
individual keyphrases. This suggests that neither experiments involving the rating of 
individual keyphrases only (as reported in fill]) n or experiments rating the quality of 
sets of keyphrases only (as proposed in |12| ) are sufficient for evaluating the 
performance of a keyphrase extraction system. 
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Abstract. Type hierarchies are important structures used in knowledge 
stores to enumerate and classify the entities from a domain of interest. 
The hierarchical structure establishes de facto a “similarity space” in 
which the elements of a same class are considered close semantically, as 
they share the properties of their superclass. An important task in Nat- 
ural Language Processing (NLP) is sentence understanding. This task 
relies partly on comparing the words in the sentence among each other 
as well as to the words in previous sentences and to words in a knowl- 
edge store. A type hierarchy consisting of words and/or word senses can 
be useful to facilitate these comparisons and establish which words are 
semantically related. The problems of using a type hierarchy for evaluat- 
ing semantic distance come from its dependency on the available words 
of a specific language, and on the arbitrariness of its classes and of its 
depth, which leads to the development of semantic distance measures 
giving arbitrary results. We propose a way to extend the type hierarchy, 
to give more flexibility to the “similarity space” , by including non-lexical 
concepts defined around relations other than taxonomic ones. We also 
suggest a method for discovering these non-lexical concepts in texts, and 
present some results. 



1 Introduction 

A natural language sentence gathers multiple words into a single unit to express 
relationships between these words. In a coherent discourse, the sentence often 
follows another sentence to which it is related. As well, the words used have 
some “default” knowledge associated with them, usually stored and organized 
in a knowledge base (knowledge store). 

Sentence comprehension aims at discovering the relationship of the words 
within the sentence to each other, but also at discovering the relationship of the 
words in the sentence to words in previous sentences and/or in a knowledge store 
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within memory. These two tasks involve respectively the investigation of syntag- 
matic and paradigmatic relations |8]. In an association experiment, a response of 
type paradigmatic could be seen as a possible substitute for the cue word, as in 
mother/father or orange/fruit, where as the response of type syntagmatic would 
represent a precedence or following relation as in mother/love or orange/eat. 
Paradigmatic relations include the hierarchical, synonymy and meronymy rela- 
tions. The syntagmatic relations include the restriction or modification relations 
and the case- type or argument relations. 

Different researchers working on semantic relations, particularly for acquisi- 
tion and representation of knowledge from machine readable dictionaries have 
used different sets of semantic relations ( I1I3I16I17I ) to achieve a representation 
of natural language sentences. While there is no agreement upon all the rela- 
tions that make up this set [^, there is general agreement on some syntagmatic 
relations, such as agent, object, instrument, time, location, and on some paradig- 
matic relations, particularly the hierarchical one. In fact, the is-a, hypernym 
or taxonomic relation is often given special attention and is used to create a 
separate structure within a knowledge base (or a domain model) called the type 
hierarchy. 

The type hierarchy is an important structure defining and classifying all 
the knowledge within a domain. For the English language, it should define and 
classify all the words and their word senses. Because of its organizing/grouping 
structure, it is often used as the basis of a semantic distance “evaluator” . 

In this paper, we emphasize the inadequacy of using the type hierarchy alone 
for that specific task of measuring semantic distance. We suggest that other 
relations, particularly of the syntagmatic type, must be involved. 

Collins and Quillian, in their quest toward understanding the importance 
given to different types of relations, talk in terms of the “accessibility” of a 
relation in a person’s memory. 

In many cases, the superset is the most aecessible property of a concept, 
though not always ( e.g. it is probably not the most accessible property of 
a nose that it is an appendage or a body organ), 

What we propose in this paper is in line with the opinion expressed by the 
quote above. The type hierarchy could be expanded to incorporate syntagmatic 
relations, and thus represent an expanded conceptual space. With that goal 
in mind, the rest of this paper is organized as follows. Section 2 introduces 
covert categories as a way to achieve the proposed expansion and presents some 
results based on an experiment of knowledge acquisition from a child’s first 
dictionary. Section 3 describes the new conceptual space, presents results of 
incorporating covert categories into the type hierarchy, and discusses semantic 
distance measures in this new context. Section 4 compares our work to other work 
on knowledge structuring. Section 5 provides some discussion and conclusions. 
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2 Covert Categories 

An analysis of a child’s first dictionary with a different research goal [S], triggered 
the interest for this research on investigating possible extensions to the type 
hierarchy. The first subsection briefly presents the child’s first dictionary and 
compares it to adult dictionaries with the aim of emphasizing other ways to 
define concepts than through their classification in a type hierarchy. The second 
subsection presents the idea of non-lexical concepts, introduced previously as 
covert categories by Cruse in his work on lexical semantics j7]. 

The third subsection proposes ways to extract covert categories from text. 
Finally the fourth subsection gives some results, some examples of covert cate- 
gories found in text. 



2.1 Beyond the Type Hierarchy 

Earlier work on machine readable dictionaries aimed at building a type hierarchy 
by analyzing noun definitions m- A noun is usually defined by the class it be- 
longs to (genus) and by its particularities when compared to the other specimen 
in that same class (differentia). Theoretically, a large type hierarchy could be 
built by collecting the genus for each defined noun in a dictionary. Unfortunatly, 
not all nouns fit so well into this definition template. Sometimes, the analysis 
of a noun definition will result in the discovery of a genus called an “empty 
head” mm. which does not say much as a superclass but is there to express 
a relation between the word defined and another word in the definition. For ex- 
ample, the definition of the word leg as a part of the body, where the word part is 
the genus, is more useful to indicate a relation (part-of) between leg and body. 
The problem with the genus/differentia template is that it obliges lexicographers 
to define a word using another word of the same part of speech. This process 
sometimes renders the resulting definitions more complex than they should be. 
Even more problematic (for our purpose), the genus which is chosen in a more 
or less intuitive way by a lexicographer who often relies on introspection, will be 
incorporated in the type hierarchy. It becomes the superclass node under which 
the defined word node is put. The semantic similarity of the defined word to all 
other words in the dictionary is now set from this chosen genus, and from the 
place its corresponding node will occupy in the type hierarchy. 

In a child’s first dictionary, the American Heritage First Dictionary (AHFD), 
things are explained in a simple way, and the classification of a noun via a genus 
is not always its most important feature. In some definitions, the first sentence 
describing the genus is completely missing; as if the lexicographers went to a more 
essential characteristic of the defined word, like its usage or purpose, something 
that would be more essential for understanding the word. 

A definition not using the genus/differentia template (or using a genus so 
general it is meaningless, e.g. something) puts the noun in relation to other parts 
of speech, often as a case relation to a verb, such as its typical object, agent or 
instrument. Definition |T] shows a few sentences following that pattern. 
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Definition 1. - 



Food is what people or animals eat. 

A habit is something you do often. 

Hair is what grows on your head. 

A pan is something to cook in. 

Sound is anything that you hear. 

Definition shows how the nouns from definition [T] are defined in the adult 
version of American Heritage Dictionary (AHD). The one sense (written in 
parenthesis) the most similar to the AHFD definition was chosen. 

Definition 2. - 



Food(l): 

Habit (1): 
Hair(l): 
Pan(l): 
Sound(la): 



A substance taken in and assimilated by an organism to maintain 
life and growth; nourishment. 

A pattern of behavior acquired by frequent repetition. 

A fine, threadlike outgrowth, espeeially from the skin of a mammal. 
A wide, shallow, open eontainer for household purposes. 

A vibratory disturbance, with frequency in the approximate range 
between 20 and 20, 000 cycles per second, capable of being heard. 



Definition [5] exemplifies the arbitrariness mentioned earlier. We can see how 
the adult dictionary sometimes provides a genus at the expense of getting into 
complicated sentence structures, as well as finding obscure nominalizations (e.g. 
vibratory disturbance). Table lUshows the different emphasis given by both dictio- 
naries. Looking at the second column of Table [T] we see that within the resulting 
type hierarchy, food will be semantically close to all other substances, and hair 
will be semantically close to all other outgrowths. 



Table 1. Comparison of definitions from the AHFD and AHD 





AHFD: case relation 


AHD: superclass 


food 

habit 

hair 

pan 

sound 


object(eat) 
object(do often) 
agent (grows) 
instrument (cook) 
object(hear) 


substance 
pattern of behavior 
outgrowth 
container 

vibratory disturbance 



If a noun can be characterized by its relation to a verb in the AHFD, how 
can we extract and encode this information in our knowledge store. We could 
have access to this information either in a structure of its own, or more effec- 
tively directly within the type hierarchy, if we expand the type hierarchy with 
non-lexical (unnamed) concepts. We found in [ 7 ] that a similar idea had been 
introduced earlier and called “covert categories” . 
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2.2 Sentence Frames Leading to Non-lexical Concepts 

The lexical items in a taxonomic hierarchy may be considered to be labels 
for nodes in a parallel hierarchy of conceptual categories. Now while the 
existence of a label without a corresponding conceptual category must 
be regarded as highly unlikely, it is not impossible for what is intuitively 
recognized as a conceptual category without a label. Sometimes there may 
be clear linguistic evidence for the existence of the unlabeled category. j7] 



Table 2. Definitions with pattern (X carry people/load) 



An airplane is a machine with wings that flies in the air. 
Airplanes carry people from one place to another. 


A balloon is a kind of bag filled with gas. 

Some balloons are huge and can carry people high into the sky. 


A boat carries people and things on the water. 


A bus is a machine. 

Buses carry many people from one place to another. 


A camel is a large animal. 

Camels can carry people and things across the desert. 


A donkey is an animal. 
Donkeys can carry heavy loads. 


A helicopter is a machine. 

It carries people through the air. 


A ship is a big boat. 

Large ships can carry many people across the ocean. 


A subway is a train that travels through tunnels underground. 
Subways carry people through large cities. 


A train is a group of railroad cars. 

Trains carry heavy loads from one place to another. 


A truck is a machine. 

It is a very large car that is used to carry heavy loads. 


A wagon is used to carry people or things from one place to another. 



As linguistic evidence, Cruse suggests a sentence frame for a normal sentence 
containing a variable X where we determine a list of items that X could be. For 
example, given the sentence frame John looked at the X to see what time it was., 
we could generate the list clock, watch, alarm clock as possible values for X. Cruse 
calls these categories with no names, but for whose existence there is definite 
evidence, covert categories. 

Many covert categories can be discovered in the AHFD. One might in fact 
have a corresponding word to it, but it is not yet part of the vocabulary taught 
to (6 to 8 year old) children. Actually, a covert category might correspond to a 
word in language A, and there are no words in language B to express the same 
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concept. Consider the vehicle category, which does have a label in English, but 
not in the child’s world of the AHFD. A vehicle in the adult AHD is defined as 
A means to carry people, loads and goods. In the AHFD, the concept of vehicle 
is expressed by an agentive case role to the action of carrying. The sentence 
frame could be X carries/carry people/loads. Table |2] shows definitions matching 
that sentence frame. A class including airplane, balloon, boat, bus, camel, donkey, 
helicopter, ship, subway, train, truck, wagon is formed, and all these lexical units 
have in common of possibly replacing X in the frame X carrying people or loads. 
Some members of the cluster (such as airplane, bus, helicopter and truck) are 
already assigned a genus (machine), but others are not. 

2.3 Discovering Covert Categories 

Covert categories are often case roles to verbs. In this paper, we explore this 
avenue, and in future work we will look into covert categories related to nouns. 

When several (defined as being above a set threshold) nouns are found in text 
(say a corpora, or a machine readable dictionary) playing the same case role to 
the same verb, we can include that covert category in our knowledge store. 
The choice of the text will have an influence on the extracted covert categories. 
Semi-technical texts aim for a simple and clear understanding of a new subject. 
Dictionaries will give facts to a reader. Both are good sources for the extraction 
of knowledge. More poetic text, or text containing multiple metaphors might 
not be appropriate. Actually, exploring covert categories with metaphors would 
be quite interesting but beyond the scope of this paper. 

Using the AHFD, we took each verb from the dictionary and built a list of 
all possible immediate relations for each of them, which includes agent, object, 
location, instrument, time. 

Using the Conceptual Graph formalism m, we build a general graph G1 
around a verb with a relation rl that will vary over all possible relations, and 
a variable X of type everything, meaning it can subsume all concepts in the 
original type hierarchy. 

For example with the verb carry, we have: 

G1 = [carry]^(rl)^[X], 

Using graph matching techniques, we project G1 onto all the graphs con- 
structed from the dictionary definitions and included in the knowledge store. 
As a result, we have a list of all projections; meaning all subgraphs that are 
more specific than Gl. When the number of occurrences of a projection exceeds 
a certain threshold (let us call it Covert Threshold, CT), we define a covert 
category. 

We can assign a label to each covert category found. The label is arbitrary, it 
is the concept definition associated with the label that is important. In practice, 
the name of the label is chosen as a mnemonic for the concept. For example, 
the label carry~agent is used for concept definition [carry]^(agent)—>[X]. When 
a single relation is involved in the concept definition, we talk of a level 1 covert 
category. 
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If there are multiple projections of G1 with the same concept replacing X, 
we refine our graph and try the projection again. For example, if the concept 
person occurs often with iT specialized to object, we create graph (G2) containing 
relation r2 that can be specialized again to all possible relations. 

G2 = [X]^(r2)^[carry]^(object)^[person]. 

Again, using graph matching techniques, we project G2 onto all the graphs 
in the knowledge store. As a result we have a list of all projections, and if the 
number of elements in that list exceeds the Govert Threshold, we define a covert 
category. At this point, as two relations are involved, we talk of a level 2 covert 
category. Any level 2 category should be subsumed by (be more specific than) a 
level 1 category. 



2.4 Examples of Extracted Covert Categories 

Result [T] and Result |^show respectively some level 1 and level 2 covert categories 
found through the analysis of the AHFD. First, we see the arbitrary label and 
the conceptual graph definition of the covert category. Then, each concept in the 
knowledge store (word or word sense) from the list of projections found, followed 
by its number of occurrences in the knowledge store. 

Result 1 - 



blow'agent: [blow] -> (agent) -> [X] . 

X = wind 3 
person 5 
horn_3 1 
instrument 1 

clean'with: [clean] -> (with) -> [X] . 

X = brush 1 
soap 1 

sew'object: [sew] -> (object) -> [X] . 

X = button 1 
cloth 1 
patch_l 8 
pocket 1 
dress 2 
clothes 3 



Result 2 - 



play'agent'person'object : 

[play] - 
{ 

(agent) -> [person] ; 
(object) -> [X] ; 

}. 



X = music 2 

baseball 1 
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game 5 
outside 2 
secret 2 
instrument 1 
sport 2 
movie 1 
time_3 1 
song 1 
well_2 1 
violin 1 

rise'agent'sun'in: 

[rise] - 
{ 

(agent) -> [sun] ; 

(in)-> [X] ; 

}. 

X = east 1 

morning 1 

send'agent'person'object : 

[send] - 
{ 

(agent) -> [person] ; 
(object) -> [X] ; 

}. 

X = card_2 1 
message 1 
person 1 
package 2 
letter 1 



3 Conceptual Space 

Now that we have defined covert categories and proposed a way of finding them 
in text, we look at including them in the type hierarchy to build a conceptual 
spac43- The first subsection will present results of this expansion, and the sec- 
ond subsection will discuss the value of semantic distance measures in this new 
conceptual space. 



3.1 Modifying the Type Hierarchy 

Let us look again at the vehicle covert category as defined earlier. This level 
2 covert category is shown in Figure H] along with another level 2 category, 
both subsumed by a level 1 category, a more general carrier concept. We see 
that some concepts already had a superclass (tractor/machine, bicycle/machine, 
camel/animal) and remain under that superclass, but they also become subclasses 
of the covert category vehicle. 

^ Our discovery process was based on graph matching because of the representation of 
all the definitions from the AHFD in our knowledge store using conceptual graphs. 
Other processes could be proposed, and at this point we are interested in the results 
independently of how the covert categories would have been found. 
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[carry]^(agent)^[X] 




Fig. 1. Part of the noun hierarchy including vehicle. 



The level 1 categories will usually subsume one or more level 2 categories, 
as they are more general. The depth of the tree (for the covert categories only) 
is now regularized as one more level is added for each new relation which gets 
specified. 

The introduction of a new node in the type hierarchy still has some arbitrari- 
ness to it, not from the definition of the node, but from the Covert Threshold 
(CT) used for deciding to include the node or not. Therefore, the CT must be 
chosen experimentally, and will depend on the text used for the covert categories 
extraction / acquisition . 
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We present two more examples of how the type hierarchy gets modified by 
the addition of covert categories. 

Figure E] presents the addition of the level 1 covert category live~in and of 
multiple level 2 categories that become subclasses of live~in as they are different 
specializations depending on the agent relation. In parenthesis after a leaf node 
we display its superclass as given by its genus in the AHFD. The superclasses 
area and place were already grouping many of the concepts together, but other 
concepts such as water, castle, ground were not grouped together, and the covert 
category live" in brings them all together. 




Fig. 2. Example of covert categories involving live'in in the hierarchy. 
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Figure E] shows the addition in the conceptual space of the covert category 
defined as [person]^(agent)^[wear]— >(object)— >[X]. It is interesting to note that 
a large part of its elements are also subclasses of clothes. But the two classes 
are not equivalent, there are also boots and hats and other things that a per- 
son can wear and they are not considered as clothes. The left part of Figure 
shows the hierarchy as extracted from the is-a relations. We note that the sim- 
ilarity between ring and skate, or ski and clothes, or coat and suit could not be 
established. Without the addition of the covert category, the smallest super- 




Fig. 3. Covert category wear'agent'person'object part of the hierarchy. 
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class common to each pair is the very general concept something under which all 
nouns are. Adding the covert category wear'agent'person'object establishes 
a similarity between all things that a person can wear. 

3.2 Semantic Distance 

To decide how close two concepts Cl and C2 are, semantic distance measures 
have been previously defined on the type hierarchy. One such distance is defined 
based on the path length going from Cl to C (the most immediate superclass 
subsuming both Cl and C2) to C2 [12]. Another distance is based on the informa- 
tiveness of C m- The informativeness is related to the frequency of occurrences 
of the word C in a text (relevant to the application domain) . 

The path length measure has been criticized because of the arbitrariness in 
the depth of the type hierarchy on which it is based. The same measure could 
be used in the conceptual space. In fact, if the conceptual space was only made 
of covert categories, the depth would be regularized and the path length would 
be a reasonable measure of distance. 

On the other hand, the covert categories prevent totally the use of informa- 
tiveness. The informativeness is based on the notion that words do exist and are 
present in text in a proportion related to their “generality” or to the amount 
of information that they give. An unnamed or non- lexical concept cannot be 
counted in a text. 

The purpose of this research is to propose ways of expanding the type hierar- 
chy. We only make a brief comment here about semantic distance. Much work in 
this area is needed, and in future work, we can explore different ways of looking 
at the conceptual space with that goal in mind. 

4 Related Work 

We now compare our work to some recent research efforts in similar directions 
involving knowledge extraction, structuring and comparison. This review is not 
meant at all to be exhaustive, but mostly to reflect efforts with different goals, 
but with a somewhat similar underlying intent. 

First, we look at the work of Faure and Nedellec, who use a similar idea of 
establishing groups of words sharing similar syntactic frames to particular verbs. 
They aim at identifying subcategorization frames for verbs. Second, we look at 
the work of Federici (and coauthors), also trying to extract groups of words 
sharing a case role to a verb, and use the results in word sense disambiguation. 
Third, we look at the work of Richardson, and the large MindNet project at 
Microsoft, which aims at constructing a very large knowledge base, and then 
being able to find similar concepts within it. 

4.1 Acquisition of Semantic Knowledge 

In [10] . particular syntactic frames are identified as triplets (Verb,prep,Noun) and 
extracted from a corpus. Then groups (clusters) are formed putting together all 
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triplets sharing the first two arguments (VI, pi,?). Their experiment is performed 
on a French corpus, and results such as (voyager,en,?) would be found, given all 
the things that someone can travel in/by (the en would be translated in this 
particular context in English by either in or by). From our perspective, they are 
finding particular level 1 covert categories corresponding to concept definitions 
[verb] ^ (preposition) ^[X]. The relations are not semantic relations but surface 
level prepositioifl. 

An interesting part of their work, is their merging of resulting groups into 
larger and larger groups, forming a cluster hierarchy (instead of a hierarchy 
of covert categories based on their level of definition). They define a cluster 
distance measure, based on the number of shared nouns, to establish whether 
2 clusters should be merged and form a single group. For example the groups 
corresponding to the frames (voyager, en,?) and (se deplacer ,en,?) are very similar 
based on their cluster distance measure and they are merged into a larger group. 
Induction follows (and must be verified by a human expert) to use the nouns 
from one subgroup merged within the frame corresponding to the other subgroup 
merged. 

The manipulation of resulting clusters is an area which we will investigate 
in the near future. It actually involves investigating the type hierarchy of verbs, 
to see if covert categories established around a certain verb could be merged to 
covert categories established around a more general verb. Faure et Nedellec do 
not investigate the verb hierarchy but mostly the nouns found in the clusters. 

4.2 Semantic Similarity and Word Sense Disambiguation 

The work of m], is quite interesting, and is part of a large effort toward building 
EuroWordNet. Multiple teams in the European Union are working toward build- 
ing WordNet m similar lexical knowledge bases in different languages. WordNet 
is probably the most used source of lexical knowlege at this time for NLP ap- 
plications. WordNet is a large semantic network of word senses. It contains a 
few relations, more than only the hypernym (is-a), but still not the syntagmatic 
relations that we are trying to capture with covert categories. 

Thesaural relationships such as hyperonymy and synonymy, however, do 
not always capture the dimension of similarity relevant to the context in 
question. m 

We couldn’t agree more. Federici_et_al are extracting groups of nouns playing 
the subject, or object role to a particular verb. In the extracted frames, the 
correct noun senses and verb senses are assigned to the verbs and nouns. The 
groups of noun senses being related to a same verb sense give some measure of 
their similarity, and the groups of verb senses related to the same noun senses 
also give some measure of similarity. When a test sentence is processed, and it 

^ We actually view deeper semantic relations and preposition as all part of a relation 
hierarchy |2]. Both could be present in a our conceptual graph representations of 
definitions. 
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contains nouns and verbs to be disambiguated, the previous groups found are 
used as evidence for some word sense disambiguation heuristics. 

Again here, level 1 covert categories are being extracted, more specifically 
around the object/subject relations. The interesting aspect here, that we will 
look into, is the clustering of the verbs with relationship to the nouns. 



4.3 Path Length in Semantic Networks 

Covert categories can also be related to the idea of establishing the similarity 
between items by finding the shortest path between them in a semantic network 
mg. Let us imagine that all our CGs for all the definitions become inter- 
connected into a large network. Two hyponyms of eat~object, such as fruit and 
potato would be at distance 2 (or some proportionally related distance) as the 
path between them is: [fruit]<— (object)<— [eat]— >(object)-^[potato]. 

The problem is that fork would also be a distance 2 from [fruit] because 
of [fruit]4— (object)<— [eat]— >(with)^[fork]. Richardson [ig does an interesting in- 
vestigation into finding particular paths which are indicators of similarity (he 
developed as well a measure of the weight of different paths). The thesaurus is 
used to establish in advance similar words and then the best possible paths are 
found between them within MindNet, a large semantic network. 

Among the 20 most frequent paths are: Hyp-HypOf (apple-fruit-orange), 
Tobj-TobjOf-Hyp (apple-eat-orange-fruit), Hyp-Hyp (castle-building-place). The 
covert categories at level 1 correspond to symmetric paths such as Tobj-TobjOf, 
or Manner-MannerOf. One advantage of the path similarity measure is that a 
path of any length can be considered, but its disadvantage is that a path does 
not branch out and therefore cannot find covert categories at level 2. For ex- 
ample, we could find a covert category carry~instrument at level 1, which has 
for hyponyms arm, bucket, pipe, car, tube, wagon and which could be found 
by a path Inst-InstOf. Now, we refine to level 2 and find the covert category 
carry~object~liquid~instrument. There is a more similarity between bucket, pipe 
and tube which cannot be expressed by a path length. 

5 Discussion 

This work gives some insight into the relative importance of the taxonomic 
relation and other syntagmatic relations for a particular task of semantic distance 
evaluation. Our work is not isolated, but in the same spirit of work from different 
groups, having different goals but all going toward an effort of generalizing the 
similarity space and of using corpus evidence to find other groups of words than 
the ones established by the type hierarchy. 

Covert categories, that is, concepts with no corresponding word, or non- 
lexical concepts, define groups of semantically related words. 

We presented covert categories at a level 1, which corresponds to triplets or 
symmetric paths (concept-relationX-concept-relationX-concept) in other works. 
A singular contribution of our work is to also propose level 2 covert categories. 
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which cannot be expressed by paths, and which express multiple case roles to 
a verb. Level 1 and level 2 covert categories expand the type hierarchy into a 
conceptual space. We can also imagine expansion to level 3. The covert categories 
also introduce the notion of semantic distance as independant of the existing 
lexical items in a language. 

Within the AHFD, if we are to use our type hierarchy (built from the ex- 
tracted genera) to establish the similarity between pen and crayon, we find that 
one is a subclass of tool and the other of wax, both then are subsumed by the 
general concept something. These two concepts on the other hand have the sim- 
ilarity of both being instruments of writing. By sharing this information, a pen 
is more similar to a crayon than it is to a pumpkin. The pumpkin is probably 
more similar to a carrot than it is to a broccoli. In the future, we shall also 
explore relations to nouns, in addition to case roles to verbs, as shown by this 
last comparison. 

Similarity can be defined at many levels, and limiting ourselves to the type 
hierarchy does not seem adequate. By finding covert categories, we add more 
dimensions to the taxonomy, more ways to find similarities between objects. 
We haven’t though propose a semantic similarity measure, which should take 
into account the multiple links that two words might share (multiple covert 
categories). Future work should focus on how exactly to determine the semantic 
similarity within that large conceptual space. 
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Abstract. In the GRAAD project we are developing a knowledge-based sys- 
tem able of determine routes in a simulated urban environment and to generate 
natural language descriptions which are not distinguishable from those pro- 
duced by human beings in similar conditions. In this paper, we present a new 
spatial model whose topology is based on the notion of object’s influence ar- 
eas. An influence area is a portion of space that people mentally build around 
spatial objects to take into account neighborhood. We use this notion to for- 
mally define the properties of neighborhood, orientation and distance in a 
qualitative way. We also introduce the notion of an object's perception area, 
an area gathering all the locations from which an object can be perceived. 
Based on these notions, we describe two modules of the GRAAD System that 
are able to find routes in a simulated urban environment and to generate route 
descriptions in natural language which are analog to those created by people. 



1. Introduction 



In the GRAAD prqjec[^we aim at developing a knowledge-based system able of 
determine routes in a simulated urban environment and to generate natural language 
descriptions which are not distinguishable from those produced by human beings in 
similar conditions. Several studies [10] [13] showed that human spatial reasoning is 
essentially qualitative [17]. This is particularly true of route descriptions as was 
shown by Gryl [12] in a cognitive study about how French subjects describe routes in 
a urban environment. She showed that route descriptions created by people are com- 
posed of two main components: landmarks which are elements of the considered 
environment and actions which are the instructions that the pedestrian has to follow. 
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In addition to verbal expressions corresponding to actions proposed to the pedestrian, 
people use several descriptive expressions based on few fundamental notions ex- 
pressed in a qualitative way: spatial expressions containing information about neigh- 
borhood (such as "You are in front of Building B"), distance (such as "You are not far 
from the train station") and/or orientation (such as "In front of you is Avenue A") and 
descriptive information based on perception (such as "On your left hand side you will 
see the tower T"). 

However, most existing qualitative spatial models lack a definition of the neigh- 
borhood relation. They generally address all or a part of the eight basic topological 
relations defined by Hernandez [13] and by Randell, Cohn and Cui [21]. These mod- 
els cannot be used to represent neighborhood because their underlying topological 
approach is only based on connectivity relations. Furthermore, these models cannot 
be used to represent perceptual information found in route descriptions. Hence, a new 
approach to spatial modeling is needed. 

As several researchers such as MaaB [19] and Raubal et al. [22], we think that 
understanding human perception of space and considering the cognitive mechanisms 
involved in human spatial reasoning provide useful insights to adequately define 
topological relations. Several cognitive psychologists think that people mentally build 
a subjective “influence area” around objects that they perceive in their environment, 
in order to speak about their relative positions, distances and orientations [4] [3]. 
Starting from this idea, we elaborated a spatial model that makes use of the notion of 
influence area to qualitatively define the relations of neighborhood, orientation and 
distance between spatial objects as well as to simulate perceptual knowledge needed 
for generating route descriptions. This model is presented in Section 2. 

Cognitive psychologists have also shown that people use some kind of mental map 
when they deal with space in various tasks such as navigation, scene descriptions and 
spatial reasoning [18] [24] [23]. Since we aimed at developing a system that could 
generate route descriptions similar to those provided by human subjects, we had the 
idea to develop a software tool to manipulate a spatial conceptual map (SCM) which 
captures in a simple way the main notions underlying human mental maps. A SCM is 
an abstraction of a real map representing a portion of the urban environment and is 
composed of landmark objects and Ways, the notions that underly human route de- 
scriptions [12]. The topological properties of a SCM are based on our spatial model 
which makes use of the notion of object’s influence area. We built the GRAAD Sys- 
tem which is able to determine routes in a SCM and to generate route descriptions in 
natural language [14]. 

In this paper, we describe the main characteristics of our model of space [15][16] 
and show how it can be used to generate route descriptions taking into account per- 
ceptual information. In Section 2, we introduce the notion of influence area and pres- 
ent the qualitative definitions of topology, distance and orientation that underly our 
spatial model. In Section 3 we introduce the notion of perception area and show how 
it is used in our model. Section 4 describes the fundamental characteristics of two 
GRAAD's modules used to generate routes and their descriptions in natural language. 
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2. Object's Influence Areas and the Spatial Conceptual Map 

Influence areas supposedly allow people to contextually reason, to evaluate metric 
measures, to qualify positions and distances between objects, etc. Hence, influence 
areas allow people to qualitatively reason about space [4] [3]. The influence area is an 
abstraction of the way that objects influence people’s vision and perception of scenes. 
It is related to the salience of objects in their environment. As an illustration of how 
people use influence areas when reasoning about space, let us suppose that you are 
located at about 500 meters from a 30 stories tower. You will certainly say that you 
are close to the tower. Now, suppose that you are located at about 500 meters from a 
bicycle. You will certainly not say that you are close to the bicycle. We can empiri- 
cally observe that instead of dealing with the same quantitative distance, our reason- 
ing can be influenced by the relative importance of objects and their associated influ- 
ence areas on the space that surrounds them. In this research, we use the notion of 
influence area as a basic notion in order to define a qualitative model of space that 
can be manipulated by a computer. In this section we provide formal definitions of 
neighborhood, distance and orientation. 





c 




IA„(0,) 

IA(0.) 

IA„(0,) 



Figure 1: Influence areas and the notions of neighborhood 
and distance 



Formally, the influence area lA of an object (9 is a portion of space surrounding O 
such that (Figure la): 

lA has two borders (an interior border and an exterior border); 

M’s borders have the same shape as 6>’s border; If from any point O. located on 
O’s border BO we draw a perpendicular line (oriented from the center of O to its 
exterior, so that we can calculate distances between objects located on this line), this 
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line crosses M’s interior border at point lAIBi and M’s exterior border at point lAEBi 
such that: (V Oi e BO) (dist(Oi, lAIBi) = cl and dist(Oi, lAEBi) = c2 and c2>cl). 
The distance dist(IAIBO lAEBO) is called the width of the influence are^ 

We can formulate the qualitative definition of neighborhood as follows (Figure 
lb): 

Object O 2 is a neighbor of object O, IFF (O^nlAiO,)) A 0 
This notion of neighborhood can only be used to specify that two objects are close 
or not. It cannot handle the subtle way that people qualify distances between objects. 
Hence, we propose to construct multiple influence areas around each object, where 
each lA would represent a certain degree of proximity, that is to say, a certain quali- 
tative distance to the object. For example, we can define 3 influence areas (Figure Ic) 
that simulate the qualitative distances expressed in natural language using the qualita- 
tive expressions very close (vc), close (c) and relatively far (rf). 

Now, the qualitative definition of distance is formulated as follows: 

Object is at a certain degree of proximity dp of Object (9, IFF 

( O, n lAJ O,)) A <p and (0,n IA,J O,)) = (p with dp >= 1 
where lA^JO,) denotes the influence area characterizing the qualitative distanc^ 
dp to Object O, and lAJOj ,) denotes the next influence area closer to Object O,. We 
consider that IAJ)0,) corresponds to the interior part of Object O^. 

In our model, we adopt Hernandez' approach to orientation [13]. We decompose 
the plan surrounding any spatial object O, into a fixed number of orientation areas 
denoted O, with respect to the intrinsic orientation of the object. For example, the 
front left of an object (9, would be denoted: O, Furthermore, we think that 

orientation and neighborhood relations are related and should be integrated in a uni- 
fied definition. 

Hence, we propose the following definition that takes into account both orientation 
and neighborhood relations: 

is at a certain degree of proximity dp of O, viewed from its orientation area OZ 
IFF : (O^ n lA^J) Oj gj) A 0 , where lAJ O, gj denotes the intersection of the portion 
of influence area lAJ OJ with the orientation area O, g^ ■ 

Certain researchers introduced concepts, which are similar to our notion of influ- 
ence area. For example, Hernandez proposed the concept of "acceptance area" which 
is based on a model of orientation that he proposed. This model of orientation con- 
sists in creating several areas of intermediate orientations for each object and to name 
them according to their degree of proximity or distance with respect to this object. 
Hernandez considers thereafter, that an object X can be "accepted" as close of an 
object Y in its orientation area OR, if the position of X allows to reach directly (in one 
transition) orientation area OR of Y. Freksa [6] proposed the model of "conceptual 
neighborhood" based on Allen’s temporal intervals [1]. Freksa’ s model was also used 



^ This definition of an object’s influence area is based on Euclidean geometry since we use the 
notions of “shape”, “perpendicularity”, etc. The software module which computes areas in 
the SCM use primitive functions based on Euclidean geometry. On top of those primitive 
functions, we developed functions that implement our qualitative model of space. 

Parameter dp can take values such as vc ("very close"), c (close) and ?/( "relatively far"). 
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by Gooday and Cohn [11] who adapted it to their own spatial model (the RCC 
model). Finally, Gahegan [7] proposed the concept of "attractiveness area" which is 
conceptually very close to our notion of influence area. Unfortunately, Gahegan only 
outlined this concept and remarked, furthermore, that its formalization would be 
complex and require a huge effort for studying and understanding the human percep- 
tion of space. 




Figure 2a: Influence areas 



Our model goes beyond the previous approaches by providing a formal framework 
for qualitatively representing neighborhood, distance and orientation using the con- 
cept of influence area. We started from a cognitive study of route descriptions pro- 
duced by people in a urban environment done by Gryl [12]. The main results of this 
study provided us with a confirmation of the qualitative nature of route descriptions 
and a categorization of nominal and verbal expressions used by people. Gryl's study 
[12] led to the determination of two structural components found in route descriptions 
generated by human subjects: local descriptions and paths. A local description corre- 
sponds to a place of the environment where the addressee will have to change her 
orientation, or a place which is worth presenting because it is noteworthy or difficult 
to recognize. Paths correspond to parts of the displacement through which the ad- 
dressee is supposed to move while following the same direction. Paths connect local 
descriptions. Usually, local descriptions contain references to landmark objects and to 
their relative spatial positions with respect to other objects or to the addressee. The 
relative positions of objects are expressed using various kinds of spatial relations such 
as neighborhood relations, topological relations and orientation relations. In these 
natural language descriptions two main elements are found [12]: verbal expressions 
and nominal expressions. Verbal expressions are verbal propositions used to express 
forward moves (such as “to walk straight ahead”; “to walk as far as x”, where x is an 
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object of the environment), orientation changes (such as “to turn to your right”) or 
localizations (such as “to be in front of y”, where y is an object of the environment). 
Nominal expressions are common or proper names or nominal propositions that are 
used to refer to objects of the urban environment. These results are quite compatible 
with studies of verbal communication for route knowledge [2]. 




With respect to the results obtained by Gryl, we define a spatial conceptual map 
(SCM) as an abstraction of a real map representing a portion of the urban environ- 
ment. It contains representations of landmark objects and medium objects. Medium 
objects (we also call them Ways) define areas on which people can move, such as 
streets and roads. Landmark objects such as buildings and monuments are used to 
help people to identify noticeable elements of the urban environment along the me- 
dium objects defining the route [20]. In addition to landmark and medium objects, a 
SCM contains also the influence areas of these objects. Figure 2a shows a portion of 
Laval University’s SCM in which we can see landmarks objects (the main buildings) 
and the ways. In addition, the GRAAD system displays on that screen the closeness 
influence areas of the Ways and landmark objects (the doted lines sutTounding them). 
Given a spatial conceptual map S and a Way object Wx, we can define two sets: 

. CLO(Wx, S) which is the set of landmark objects Oj contained in S whose closeness 
influence areas, denoted CL^j , have a non empty intersection with Wx: 

CLO(Wx, S) = { Oj\ (CL^^ nWx^ 0)j; 

. IWO(Wx, S) which is the set of Way objects Wy contained in S which have a non 
empty intersection with Wx: 

IWO(Wx ,S) = {Wy\(WynWx^ 0) }. 

The first of these two sets logically represents the relation between a given Way 
and its closest landmarks while the second of them logically represents the relations 
between a given Way and its intersecting Ways. Using these two sets and our model 
definitions of neighborhood, distance and orientation, it is possible to logically parti- 
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tion the portion of Wx contained in S into a set of tix consecutive segments Wx[k] for 
k = 1, nx such that one of the 4 following cases holds: 

(cl) Wx[k] is marked by at least one landmark object: 

(30j e CLO(Wx, S)) (CL^^ n Wx = Wx[k]h 
(c2) Wx[k] is a crossing of Ways: ( 3 Wy e IWO(Wx, S)) (Wy n Wx = Wx[k]) 1 
(c3) Wx[k] is an intersection between a crossing of a Way with Wx and closeness 
influence areas of one or several landmark objects: (3 Oj e CLO(Wx, S)) 

(CL^^ n Wx = Wx[k]) and ( 3 Wye IWO(Wx, S)) (Wy n Wx = Wx[k]h 
(c4) Wx[k] is a straight unremarkable segment such that: (VOj e CLO(Wx, S)) 
(CL^j n Wx[k] = 0 ) and (VWye IWO(Wx, S)) (Wy n Wx[k] = 0)- 

We call a Way Elementary Area (WEA) any segment Wx[k] that is part of a Way 
Wx in the SCM. Figure 2b shows how the Ways of Laval University's SCM have been 
partitioned using cases (cl) to (c4). 

Given a point A of a SCM S located in a WEAu][ffi] and a point B located in a 
WEAii2[n]^ a route Ra,B from point A to point B is a succession of adjacent WEAs 
that connect A to B. The corresponding set of portions of Ways is denoted 
RWP(Ra,B>S)- Hence, a route Ra,B i® a succession of route segments RA,B[k] for k 
=1 to p such that: 

RA,B[1 ] = W£A„,,„,; RA,B[p ] = 

For any k such that 1 < k < p, RA,B[k] i® a portion of Way such that: 

(3ux) (3q) (WEA^,^, e RWP(Ra,B, S) AND RA,B[k] = WEA^^J. 

Now, each segment of a Way can be identified and logically defined using cases cl 
to c4. Hence, Ways are partitioned into WEAs that can be qualified in terms of the 
proximity of landmark objects. This property of WEAs is used to qualitatively de- 
scribe the relevant portions of a route in a route description (See Section 4). 



3. Perception Areas 

In order to use perceptual information in route descriptions, we introduce the notion 
of perception area, an area gathering all locations from which an object can be per- 
ceived. In fact, we can associate several perception areas with an object, each char- 
acterizing the locations from which a part of the object can be perceived. For in- 
stance, the top of a tower can be seen from positions from which the rest of the tower 
cannot be seen. If we can estimate the heights, widths and relative positions of the 
objects contained in a spatial conceptual map, perception areas can be calculated. For 
this purpose, we calculate the perception area of an object O as a visibility circle in 
which subareas are carved: the sub-areas correspond to areas where other objects 
prevent the perception of object O. 

Figure 3 gives an example of the areas resulting from such calculations. However, 
this method does not provide accurate results in complex environments. Let us re- 
mark that for route descriptions, we are particularly interested in the portions of Ways 
from which a pedestrian can perceive landmark objects. Hence, a more accurate ap- 
proach consists in walking along the Ways in the geographical area of interest, identi- 
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fying the perception areas of the main landmark objects and reporting them on a map. 
Then, the perception areas of objects viewed from the Ways are manually input in the 
SCM as perception partitions of the Ways. We added a tool to the GRAAD system in 
order to draw those perception areas on the SCM (Figure 4). 




Figure 3: Calculated perception areas 



Now, we have two partitions of Ways in a SCM: 7j a partition corresponding to 
landmark objects' influence areas that are used to qualify topological relations (Figure 
2b) and 2) a partition corresponding to landmark objects' perception areas (Figure 4). 
Those two partitions can be merged in order to get a resulting partition in which each 
elementary Way segment is characterized by proximity information (thanks to inter- 
sections with landmark objects' influence areas) and perceptual information. The 
GRAAD system is based on our spatial model. It operates in a simulated urban envi- 
ronment in which a character called a “Virtual Pedestrian” (VP) can move. It is able 
to generate routes and generates natural language. 



4. Using Perceptual Information to Enhance Wayfinding and Route 
Description Algorithms 

Since we have already addressed the issue of wayfinding using our model in another 
paper [15], we will just outline here its main characteristics. Our approach for route 
construction is based on the determination of a path composed of a sequence of Way 
elementary areas (WEAs) which are parts of the Ways of the SCM and are obtained 
after merging the Ways' partitions based on closeness influence areas and on percep- 
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tion areas. In order to reason about WEAs and displacements, GRAAD generates a 
Matrix of Orientation and Adjacency (MOA) which contains relevant information 
about angle evaluation and displacement directions that are used by the path determi- 
nation algorithm. The columns and lines of the MOA represent the WEAs of the 
SCM and each cell of the matrix MOA(i,j) (where i and j respectively correspond to 
the column WEA, and the line WE A.) contains information about WEAs' adjacencies 
and their relative orientations. Each cell of the MOA is also associated with a list of 
the landmark objects that are close to the corresponding WEA (intersection with the 
closeness influence area). In addition, each cell of the MOA is associated with a list 
of all the landmark objects that are visible from that WEA. 




Figure 4: Manual determination of perception areas 



In wayfinding applications, people generally use specific criteria to choose the 
"best candidate" among all possible candidates for their next displacement. Empirical 
evidence shows that, in order to build a route to reach a target object, one possible 
strategy used by a person consists in minimizing the angle between her current dis- 
placement orientation and the estimated orientation of the target object viewed from 
the person’s current position [9]. 

We call that angle "the human subject's vision angle to the target object" (or the 
"vision angle" for short). Using our model, we implemented a way finding strategy 
based on the minimization of the vision angle. Our wayfinding approach consists in 
systematically minimizing the difference between the vision angles of the current 
position and of the next possible position on a path toward the target position. All 
possible candidate WEAs for the next displacement are evaluated with respect to the 
minimization of vision angle criteria and the best one is chosen [14][15]. 
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When designing the conceptual map, GRAAD asks the user to associate a salience 
measure with each landmark object that she draws. This salience measure [8] charac- 
terizes the object's distinctiveness in the landscape and is used when generating the 
route description. Hence, objects with a higher salience measure should attract more 
easily the pedestrian's attention. 




Figure 5: System Interface 



In a previous version [14], the route description algorithm only used information 
about influence areas. In this section we briefly present a new version of the algo- 
rithm that uses perceptual information as well. As an input, the description algorithm 
takes the route generated by GRAAD's route generation module. The main idea of the 
description algorithm consists in detecting direction changes and in describing each 
route portion between two direction changes. For each direction change, the module 
generates a sentence describing the direction change with respect to the pedestrian 
intrinsic reference frame (Ex: "Turn on your left"). The route description between two 
direction changes may contain three parts: one refers to perceptual information, an- 
other part refers to neighborhood information and the last one refers to Ways' cross- 
ings where there is no direction change. The perceptual part consists in determining 
among the list of landmark objects seen from VP's current position (found in the 
matrix MOA) an "attractor object" that characterizes the general direction to be fol- 
lowed until the next direction change on the route. Typically, this will correspond to 
sentences such as "You see building X with characteristic Y^. Walk in that direction". 
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Attractor objects are chosen according to their salience and to their relative angular 
distances to the current displacement direction. GRAAD maintains a list of the at- 
tractor objects that have been already mentioned in the route description in order to 
describe each of them only once when it is appearing in the pedestrian's sight. It is 
also possible to mention the disappearance of an attractor object. The neighborhood 
part of the description algorithm uses the information about landmark object's neigh- 
borhood (based on the closeness influence area information in the MOA) as it was 
done in [14]. It typically generates sentences of the type "You walk besides objectX", 
"You pass between objectX and object Z or "You arrive in front of object X" . When 
the route crosses an intersection between the Way on which VP currently walks and 
another Way on which there is no change of direction, the description algorithm will 
do nothing if it is generating a concise description, or it will mention the Way cross- 
ing and provide an advice to go on walking in the current direction when it is gener- 
ating a detailed route description. GRAAD's description generation algorithm can 
generate different kinds of route descriptions which vary according to the number of 
details given in each of them. Simple criteria are used to choose which details should 
be kept (Kettani 1999). 

Figure 5 presents GRAAD's route generation and description interface. In the 
background there is a conceptual map of Laval University's campus. In the upper 
right corner there is a dialog window that appears when the user selects the route 
generation tool in the tool bar at the top of the main window. In the dialog window 
are displayed the origin and destination locations that the user chooses by clicking on 
relevant landmark objects in the conceptual map. GRAAD generation algorithm gen- 
erates the route which is displayed on the conceptual map using a succession of col- 
ored segments that corresponds to the WEAs composing the route. The virtual pedes- 
trian's position is marked on the route by a succession of red dots. The route descrip- 
tion can be generated step by step. In Figure 5, we see the description window on the 
left hand side of the screen. In the top part of this window the route description algo- 
rithm displays the sentences that are relevant to the current position of VP on the 
conceptual map. If there is a perceptual indication, a picture of the corresponding 
attractor object appears in the bottom part of the window. These pictures are chosen 
in a picture base that is indexed by the positions from which landmark objects can be 
seen. Figure 6 displays the complete text describing the route. 



5. Evaluation and Further Work 

Kettani [14] performed an experiment with two groups of 10 persons in order to 
measure the cognitive adequacy of GRAAD's route descriptions. The persons of the 
first group were given a map of the campus and asked to describe a route between 
two buildings. The same route was described by GRAAD. All the descriptions were 
typed and given to the second group for evaluation. The persons of the second group 
were not able to distinguish GRAAD's description from descriptions created by the 
subjects of the first group [16]. The version of GRAAD that was used for this ex- 
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periment did not use the perceptual information that we recently introduced in the 
system. We will need to conduct a new set of experiments to test the cognitive ade- 
quacy of the new route descriptions which contain perceptual information. The ex- 
perimental setting will be different. Indeed, in Kettani's experiment subjects were 
given a campus map and asked to describe the route. They were not naturally inclined 
to introduce visual information in their descriptions. In order to validate GRAAD's 
descriptions containing perceptual information, we will ask people to describe the 
route while walking along it: they will be naturally inclined to give visual informa- 
tion. 




Figure 6: A complete description of a route 
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Abstract. A natural language interface to a consumer service provider database 
is presented. While natural language interfaces to applications are becoming 
more common, they are generally not very robust. We present an approach that 
addresses three specific issues related to robustness. We describe an approach 
for dealing with underspecified queries, for processing queries that require 
some limited form of inference, and for handling queries involving synonyms. 



1 Introduction 

The task of information access and retrieval can quickly become complex for the 
typical user, whether the information is extracted from a highly structured source (like 
a database) or from an inconsistently structured source (like the World Wide Web). 
There are many information management specialists who can deal with the 
complexity of information access and retrieval from vast information sources through 
the use of sophisticated knowledge management systems. However, the growth of the 
Web and of the use of handheld digital devices has resulted in more people desiring 
access to a wide range of information. Moreover, many of these people require 
simpler mechanisms to access this information than those used by information 
management specialists. 

A simple information access mechanism that all users are familiar with is the use 
of natural language. Natural language is the most frequently used medium for humans 
requesting information of humans. Why can't humans use this same medium for 
requesting information from computer systems? Natural language has advantages 
over artificial (human designed) languages typically used in human computer 
interactions in that it is easy to use and it is already known by all users. But, natural 
language is notoriously ambiguous, and natural language interactions typically 
involve information that is not explicitly mentioned in the query. 

The ambiguity and underspecification are not necessarily problematic when 
humans interact with other humans, since the people involved are able to enter into a 
dialogue to disambiguate and/or clarify statements and requests that are uttered. Such 
dialogues represent further challenges for computer systems that use a natural 
language interface to interact with users. Note that dialogues are not limited to the 
purposes of disambiguation or clarification. They can also be used to gather additional 
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information in cases where it is not possible or appropriate to give all of the required 
information in a single statement or query. Humans might engage in such dialogues in 
the context of tasks like form filling. 

While dialogue is one important approach for resolving ambiguity and 
underspecification, these problems can also be resolved through the use of inference 
in conjunction with a knowledge base. Most natural language interfaces only deal 
with requests for specific information contained in databases or documents. 
Sometimes, though, the user may not phrase a query as a direct request. For example, 
instead of saying "I need a plumber," a user might say "My sink is plugged." The 
system should be able to infer that a plumber is required and respond accordingly. 

Another significant aspect of natural language communication is that the same 
concept can be expressed in a variety of ways. Since human languages provide a great 
deal of freedom with respect to the choice of words and phrases, it is essential that 
the interface provide a similar degree of freedom if it is truly going to be a natural 
language interface. 

In this paper, we introduce a prototype natural language system developed for 
ServisNet Communications, illustrating some approaches for how: 1) dialogue can be 
used to resolve ambiguity and underspecification in natural language queries, 2) a 
limited form of inference can be incorporated in a natural language interface to a 
database, and 3) a domain independent approach to synonyms can be incorporated 
into a natural language querying system. 



2 Background 

The prototype system was designed to act as a natural language front end to allow 
consumers to access a database of service providers for typical household services, 
such as plumbers, electricians, etc. ServisNet's database will provide users access to 
more than 400 services in the areas of residential, commercial, personal, automotive 
and electronic services. While there are a variety of commercial (off the shelf) 
products available that allow the development of natural language interfaces to 
databases, they are limited with respect to the three issues enumerated in the previous 
section. 

The goal was to develop a system that allows the user to type in not only service 
descriptions, like I need a plumber, but also problem descriptions, like I need to fix a 
broken mirror in my bathroom. The system would then process the query, determine 
what type of occupation/service is needed, and output a requisition form for the given 
service provider. 

Of course, support of query variations would also be required. For example, the 
following syntactic variations should all be recognized: 

• I need to fix a broken mirror in my bathroom. 

• I need to get a broken mirror in the bathroom fixed. 

• A broken mirror in my bathroom needs fixing. 

• A broken mirror in the bathroom needs to be fixed. 

• The mirror in my bathroom is broken. 

• I need a broken mirror fixed in my bathroom 




84 



Petr Pp Kubon, Fred Popowich, and Gordon Tisher 



Additionally, lexical variation was also desired. So, the system would need to 
recognize the following queries, and those with similar lexical variations: 

• I need to fix a broken mirror in my bathroom 

• I need to repair a broken mirror in my bathroom 

• I need to mend a broken mirror in my bathroom 

• I need to restore a broken mirror in my bathroom 



3 The Approach 

3.1 Query Analysis 

The prototype works with a set of query templates. When a query is posed to the 
system, an appropriate template is selected from the set and used to retrieve a set of 
keywords from the query. The prototype makes use of a set of about 50 templates, 
which are in the form of regular expressions with minor variations included in 
disjunctions. The templates were determined based on a set of natural language 
queries supplied by ServisNet. The identified keywords are then mapped by the 
inference component to the identification code of an appropriate service provider 
from the database. While a template-keyword approach clearly has limitations in an 
unconstrained domain, we are dealing with a highly constrained domain (specifically, 
a customer service database with a very simple structure). Additionally, the template- 
keyword approach is simple [1] and it is well understood and has been applied 
successfully in other highly constrained systems [2]. 

In the simple, typical case, the keywords retrieved from the query will uniquely 
determine the appropriate service identification code, and no real inference is needed. 
This situation is illustrated in the example in Fig. 1. Here, the ACTION and THING 
keywords identify the problem domain as that of "fixing mirrors." The inference 
component works with an underlying knowledge base, which associates "fixing 
mirrors" with a unique service identification code. The inference component retrieves 
this code and returns it to the user. 

query: eed tofxab o e o y at oo 

template: eed to ACT O a TY T G y OC 

keywords ACT O fx,TY b o e ,T G o , OC at oo 
database code: dbc1234 (Glass and Glazing Work and Installation) 



Fig. 1. Example query with unique service identification code. 
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3.2 Inference and Dialogue 

The second, more complex case arises when the initial result of the query processing 
is uncertain, i.e., when the set of query keywords corresponds to a set of possibly 
appropriate service types. If this happens, the system asks the user to modify the 
query to arrive at a unique result and provides the user with the information which 
parts of the query need to be modified. This interaction between the user and the 
system can happen several times for a single query, until the user supplies enough 
information for the system to arrive at a unique service identification code. Note that 
this problem is a restricted version of a problem encountered by more general natural 
language dialogue systems [3]. 

The example in Fig. 2 illustrates the inference and dialogue capability of the 
system. Here, the THING keyword is missing from the query. The ACTION keyword 
identifies the problem domain as that of "painting something." This information is not 
specific enough to identify a unique service provider. The underlying knowledge base 
of the system reflects this uncertainty by associating "painting something" with three 
possible service codes: 1) dbc2334 for painting rooms, walls, etc; 2) dbc2335 for 
painting roofs; and 3) dbc2336 for painting cars, trucks, etc. The system retrieves the 
set of possible service codes and engages the user in a dialogue. Note that the goal of 
this dialogue is not to obtain information (or parameters) from the user to fill in some 
blank fields in a form (cf. [3]), instead it is a simpler goal of determining the unique 
code of interest. 



query: eedtodoso epa t g 

template: eedtodoso e ACT O g 

keywords ACT O pa t, TY (opto a),T G (eg ed), 

LOC= ? 

system's reply: the system Indicates to the user that the query has to specify a 
THING, I.e., the object of the ACTION. It also prints out a list 
of possible objects ( oo , wa , oof, ca , etc.) to help the user 
In modifying the query. 

modified query: eed to pa ta wa 

template: eed to ACT O a T G 

keywords ACT O pa f T G wa 

database code: dbc2334( a t ga d ape a g g) 



Fig. 2. Example involving an initial query which is underspecified, which is then modified by 
the user to obtain a unique database code. 
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3.3 The Reasoning Method 

As described in the previous sections, the system maintains an internal knowledge 
base that associates problem descriptions with appropriate service codes. The 
associations are not directly contained in the knowledge base; rather, they are 
constructed by the inference component of the system making use of rules associated 
with keywords. 

The process works in the following way. The knowledge base consists of an 
inheritance hierarchy, and a keyword lexicon. The hierarchy establishes relationships 
between the different ACTIONS, TYPEs, THINGSs, and LOCATIONS used by the 
system. These relationships allow information to be shared between related concepts. 
The keyword lexicon provides a set of rules that are associated with each keyword. 
Each rule (which could also be viewed as a constraint) consists of; 1) a service 
identification code, and 2) a set of typed keywords. The rules are interpreted as 
follows. If the query contains the word together with the keywords from 2), then the 
result of the query is the service identification code from 1). Eor example, the entry 
for the keyword paint has the following structure: 



paint { dbc2334: 

THINGS wall, room 
OTHERS ANY; 
dbc2335 : 

THINGS roof 
OTHERS ANY; 
dbc2336 : 

THINGS car, truck 
OTHERS ANY; 

} 



Here, the first rule specifies that the paint ACTION needs to combine with the wall 
or room THING and optionally any other category (TYPE, LOC, etc.) of keyword to 
give the code dbc2334 as a result. The second and third rule should be self- 
explanatory. 

During query processing, the query keywords are found in the knowledge base, and 
their respective rules sets are processed. If the processing results in a single code, then 
the system is able to produce a result. A query like I need to do some painting 
contains only a single keyword, and results in a set of three possible codes. Since 
there is no unique code, the system then engages the user in the dialogue described in 
the previous section. On the other hand, the query I need to paint a wall combines the 
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rules associated with paint with those associated with wall, which allows the system 
to uniquely determine the appropriate service provider dbc2334. 



3.4 Treatment of Synonyms 

The system makes use of WordNet [4] to replace unknown words with synonyms. For 
example, only one of the verbs (fix, repair, mend, restore, ...) needs to be explicitly 
stored in the vocabulary to allow the system to accept the following queries: 

• I need to fix a broken mirror in my bathroom. 

• I need to repair a broken mirror in my bathroom. 

• I need to mend a broken mirror in my bathroom. 

• I need to restore a broken mirror in my bathroom 

Thus, (lexical) semantic variations on many queries can be handled easily. Other 
natural language interfaces, such as START [5], make good use of synonym sets in 
order to increase the robustness of the system. However, the use of an external lexical 
resource such as WordNet can greatly simplify the design and maintenance of the 
lexical database used by a natural language processing system. 



4 Limitations and Future Work 

The prototype system works with a small knowledge base: the base vocabulary 
consists of about only 100 words (not including synonyms) dealing with a dozen 
services. A major challenge in creating the knowledge base required for a complete 
system lies in specifying the rules, associated with individual keywords, which are 
used to determine the set of services appropriate for a given query. To derive these 
rules, developers need a case base of user queries and corresponding services, which 
would be maintained and extended during the use of the system. The vocabulary 
format of the prototype was designed to allow for (semi)automatic extraction of 
keyword rules from the case base and for an easy update of the knowledge base with 
the extracted rules. An interesting research problem in this area is finding out to what 
extent the task of maintaining the knowledge base and its overall consistency in 
terms of the rules can be done automatically, as well as designing appropriate 
algorithmic support for this task. While the START system [5] does automatically 
construct its knowledge base from sample text, it does not deal with the issue of 
determining how appropriate the stored data is for various queries. 

The efficiency of associating a combination of keywords with a specific service 
depends on the organization of the knowledge base. One promising way of reducing 
the search space in the full version of the system lies in dividing the knowledge base 
into parts, with each part containing only (key)words of one category (ACTION, 
THING, etc.). Within each part, the words would be organized in a hierarchy. Each 
keyword category would thus have an inheritance tree, and a service would be 
associated (via constraints) with a combination of nodes across the trees. The higher 
up in the tree(s) the search would stop by successfully determining a service provider, 
the more effort would be saved in the reasoning component by not having to deal 
explicitly with the nodes below. The main challenge related to the use of this method 
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in an extended system lies in tailoring the type hierarchy for the purposes of the task 
at hand, i.e., efficiently selecting a service provider. The existing systems which make 
use of type hierarchies typically use more general classification, as exemplified by 
Wordnet [4]. Finding an appropriate "service type" hierarchy and designing ways for 
its (semi)automatic maintenance both present interesting research problems. 

The query processing component of the system can be extended along several 
dimensions. First, the method of analyzing a query with respect to the set of possible 
templates could be replaced by a more general method of deriving the internal 
representation of the query (in terms of keywords or some other suitable 
representation) by a parser. Before moving in this direction, some system test results 
involving actual end users should be examined to see the kind of queries and 
constructions encountered. There are plans for a detailed evaluation of the prototype 
once the ServisNet database and consumer service system are available later this year. 
Nevertheless, the use of a full or partial parser should allow for a more sophisticated 
syntactic and semantic processing of the query, and this in turn would allow for 
placing additional restrictions on the search space by imposing some linguistic 
constraints (for example, selectional restrictions) on the query. 

Additionally, an appropriate method for dealing with unknown words should be 
added in the full version of the system. The prototype makes a significant step 
towards this goal by demonstrating that the approach adopted by the system seems 
naturally suited to this task— to all extents and purposes, an unknown word can be 
treated as a missing word (of an appropriate category). The solution devised for 
dealing with incomplete queries thus naturally extends to complete queries containing 
unknown words within the data set covered by the prototype. What remains to be seen 
is whether this approach can be applied to more complete data sets, and whether end 
users find this treatment appropriate. 

Finally, the prototype in its present incarnation concentrates solely on specific 
problems (broken mirror, plugged drain, etc.), while there are other (although menu- 
based) systems on the market (ex., ImproveNet, http://www.improvenet.com/) that 
start with the big picture (bathroom remodeling) and then progressively narrow it 
down. This allows them to specify several different bathroom-related services in one 
interaction with the user, while our current prototype would have to go through 
several specific queries to get the same result. On the other hand, our system engages 
the user in a dialogue in case of queries with incomplete information. If the dialogue 
capabilities of the system were extended to support more general cases of gathering 
information from the user (cf. [6], [3]), both general and specific problems could be 
handled equally well. 
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Abstract. While classical approaches deal with prototype selection (PS) 
using accuracy maximization, we investigate PS in this paper as an infor- 
mation preserving problem. We use information theory to build a statis- 
tical criterion from the nearest-neighbor topology. This statistical frame- 
work is used in a backward prototype selection algorithm (PSRCG). It 
consists in identifying and eliminating uninformative instances, and then 
reducing the global uncertainty of the learning set. We draw from exper- 
imental results and rigorous comparisons two main conclusions: (i) our 
approach provides a good compromise solution based on the requirement 
to keep a small number of prototypes, while not compromising the clas- 
sification accuracy; (ii) our PSRCG algorithm seems to be robust in the 
presence of noise. Performances on several benchmarks tend to show the 
relevance and the effectiveness of our method in comparison with the 
classic PS algorithms based on the accuracy. 



1 Introduction 

Thanks to new data acquisition technologies such as the World Wide Web, the 
modern data sets (DS) are huge, informative but containing more or less noise. 
In the Machine Learning and Data Mining fields, computer scientists are then 
induced to develop theoretical methods allowing to extract useful information 
{data distillery), in order to build significant and relevant predictive models, 
while reducing computational and storage costs. This useful information can 
concern not only the features characterizing a given problem {feature selection), 
but also the collected instances {prototype selection). For the first ones, we search 
for the most discriminant explicative variables. For the second, we try to extract 
typical and relevant examples of the different represented classes. 

We focus in this paper only on prototype selection (PS). Available technol- 
ogy to analyze data has been developed over the last decades, and covers a 
broad spectrum of techniques, algorithms, statistics, etc. PS raises the problem 
of defining relevance for a prototype subset, and identifying relevant or irrelevant 
instances. An irrelevant instance, that a PS algorithm should be able to remove, 
can belong to four main categories. The first belong to regions in the feature 
space with very few elements. Even if most of these few points belong to the 
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same class, their vote is statistically a poor estimator, and a little noise might 
affect dramatically their contribution. It is also common in statistical analyses 
to search and remove such points, in regression, parametric estimations, etc. The 
second belong to regions where votes can be assimilated as being randomized. 
Local densities are evenly distributed with respect to the overall class distribu- 
tions, which makes such regions with no information. The third instances belong 
to the most concentrated areas between the border and the center of each class. 
They are not irrelevant by the risk they bring during the generalization, such as 
for the second. They constitute irrelevant instances in the sense that removing 
such examples would not lead to a misclassification of an unseen instance. The 
last category concerns mislabeled instances, that can come from variable mea- 
surement techniques, typing errors, etc. Then, a suitable PS algorithm should 
be able to efficiently remove these four categories of instances. 

In [ 22 ], authors present a framework allowing to cluster the several PS algo- 
rithms proposed in literature according to different criteria: 

— instance representation', does the PS algorithm retain a subset of the original 
instances [DSIEKSUITIEI, or modify the instance using a new representation 

[a, [IS]? 

— type of selected instances', border mm or central points [ 231 ? 

— direction of search: incremental, decremental or batch? 

Note that the parameter to optimize during the selection, which is almost 
always the accuracy in a PS algorithm, is not proposed as a clustering criterion. 
Beyond its belonging category, one have to compare the relative strengths and 
weaknesses of a new PS algorithm, according to other criteria [22] : storage re- 
duction, generalization accuracy, noise tolerance, learning speed, etc. A ’’good” 
PS algorithm must then provide a good balance between these criteria. 

Historically, PS has been firstly aimed at improving the efficiency of the 
Nearest Neighbor (NN) classifier [^. Conceptually, the NN classifier [5j is prob- 
ably the simplest classification rule. Its use was also spread and encouraged by 
early theoretical results linking its generalization error to Bayes. However, from 
a practical point of view, this algorithm is not suited to very large DS because 
of the storage requirements it imposes. This approach involves indeed storing all 
the instances in memory. Pioneer work in PS firstly searched for reducing this 
storing size. In [2], Hart proposes a Condensed NN Rule to find a Consistent 
Subset, CS, which correctly classifies all of the remaining points in the sample 
set. However, this algorithm does not find a Minimal Consistent Subset, MCS. 
The Reduced NN Rule proposed by Gates |B] tries to overcome this drawback, 
searching in Hart’s CS for the minimal subset which correctly classifies all the 
learning instances. However, this approach is efficient if and only if Hart’s CS 
contains the MCS of the learning set, which is not always the case. In |T], the 
presented IB2 algorithm is quite similar to Hart’s CNN rule, except it does not 
repeat the process after the first pass. A common characteristic of these ap- 
proaches is that they are very sensitive to noise because noisy instances will be 
probably misclassified, and then not deleted. First solutions to improve IB2 were 
proposed in IBS [T]. 
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In [^, Skalak proposes two algorithms to find sets of prototypes for NN 
classification. The first one is a Monte Carlo sampling algorithm, and the second 
applies random mutation hill climbing. In these two algorithms, the size of the 
prototype subset must be unfortunately fixed in advance. Moreover, they require 
to fix another parameter which contributes to increase the time complexity: the 
number of samples (in the Monte Carlo method) or the number of mutation. All 
these user-fixed parameters make the algorithms much user-dependent, despite 
interesting performances. 

Brodley in [3] with her MCS system, and in [3] with Consensus Filters, also 
deals with the PS problem. While the approach presented in |3] is rather an 
algorithm for eliminating mislabeled instances, than a real PS algorithm, we will 
see in our paper that it could constitute a good pre-process of the PS problem. 
This approach constructs a set of base-level detectors and uses them to identify 
mislabeled instances by consensus vote, i.e. all classifiers must agree to eliminate 
an instance. 

Finally, in the algorithm RTS of Wilson and Martinez ( called DROPS 
in and its extensions DROP4 and DROPS), an instance uji is removed if its 
removal does not hurt the classification of the instances remaining in the sample 
set, notably instances that have uJi in their neighborhood (called associates). 
More precisely the algorithm is based on 2 principles: (1) it uses a noise-filtering 
pass, that removes instances misclassified by their fcNN, that helps to avoid 
’’overfitting” the data; (2) it removes instances in the center of clusters before 
border points, by sorting instances according to the distance to their nearest 
enemy {i.e. of a different class). 

Beyond the fact that they retain a subset of the original instances (we focus 
only on such algorithms in this paper), all these approaches have the common 
characteristic to select a prototype as an instance which optimizes the accuracy 
on the learning set. While the accuracy optimization is certainly a way to limit 
further processing’s errors on testing, recent works have proven that it is not 
always the suitable criterion to optimize. In m. a formal proof is given that 
explains why Gini criterion and the entropy should be optimized instead of 
the accuracy when a top-down induction algorithm is used to grow a decision 
tree. Same kind of conclusions are presented with another statistical criterion in 
|I3|. In this paper, we propose to take into account these ideas and use a new 
criterion. We proposed this one in a feature selection algorithm |14| which has 
shown very interesting empirical results in a rigorous theoretical framework. We 
propose in this paper to extend the principle and the use of this criterion to the 
specific field of prototype selection. It is based on a quadratic entropy computed 
from information contained in a neighborhood graph. Since the A:— NN graph is 
probably the simplest neighborhood graph, and since the majority of approaches 
of PS are based on the A:NN (often A: = 1), we use this neighborhood structure 
in our paper. However, the theoretical results presented here don’t depend on 
the A;NN graph, and other geometrical structures can be used. 

Our criterion is asymptotically normally distributed, that allows in this paper 
the construction of a statistical test assessing the gain of information due to 
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the deletion of an irrelevant instance. Our approach is based on a backward 
algorithm, called PSRCG. It starts from all the learning instances, and it deletes 
step by step irrelevant examples while our criterion is not stabilized. To highlight 
the interest of our approach, we compare performances of PSRCG to the other 
classic methods of PS based on the accuracy and cited above. A rigorous study 
and a discussion are presented to compare the relative strengths and weaknesses 
of each algorithm. 

2 Theoretical Framework 

Definition 1. Let 



C 

Vj = 1, .., c jj > 0 and X) 7i = 1 

be a c- dimensional simplex, where c is a positive integer. An entropy mea- 
sure is an application from Sc in 1R+, with the following properties: Symmetry, 
Minimality, Maximality, Continuity and Concavity 




Definition 2. The Quadratic Entropy is a function QE from [0, 1]'^ in [0, 1], 

QE-.Sc^ [0,1] (71,. .,7c) ^ QA((7 i,..,7c)) =X 7j(l - 7i) 

Given the previous definitions, we use the quadratic entropy to measure local 
and total uncertainties in the kAA graph built on the learning set S, where n is 
the current number of instances and c is the number of classes. 

Definition 3. we define the neighborhood N{u>i) of a given u>i instance belonging 
to S as following: 

N{cOi) = {coj G S / LOi is linked by an edge to uij in the kNN graph} U {coi} 

{oji} is inserted in N{uji) to facilitate the elimination of mislabeled instances. 

Definition 4. the local uncertainty Uioc{w>i) for a given u>i instance belonging 
to S is defined as following: 



UioM)=J2 ^( 1 -^) 

i=i 

where Ui, = card{N (uji)} 

and Uij = card{uji G N{uji) \ Y{u>i) = yj} where Y{uji) describes the class of w;. 



Definition 5. the total uncertainty Utot in the learning sample is defined as 
following: 
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n c 

Utot-2^ — 2^ — J 
■■ 

n 

where n,, = ^ rii, = n + 2card{E} 

i—1 

where E is the set of all the edges in the kNN graph 

Works proposed in [T^ show that the distribution of the relative quadratic 
entropy gain is a with (n — l)(c — 1) degrees of freedom. Rather than taking 
directly Utot as criterion, we then define the following Relative Certainty Gain, 

Definition 6. Let RCG be the Relative Certainty Gain, 

RCG = 

Uo 

where Uq is the uncertainty computed directly from the a priori distribution. 

Uo=t ^( 1 -^) 

i=i 

where Uj = card{uji / Y{oji) = pj}, i.e. where the label of Wi is the class yj 

Note that Uq and Utot must be computed from the same number of instances. 
Then, according to the nature of the deleted instance during the process, a given 
Uj will not keep the same value. Also, updating Uq is relatively computationally 
unexpensive for the reason that only the points in the neighborhood of the re- 
moved instance will be concerned by a modification. Updating Uq after removing 
uji does not cost more than 0(nij, a complexity which can be greatly decreased 
by the use of sophisticated data structures such as k-d trees PI- 

For reasonably large learning sets (n > 30), the distribution of n,RCG is 
approximately normal with mean (n — l)(c — 1 ) and variance 2(n — l)(c — 1 ). 

These statistical properties will allow to test if the RCG, at a given step of 
our process, is significantly no null after the deletion of an irrelevant instance. 
In the contrary case, our algorithm will stop. Moreover, the critical thresholds 
of the test at steps t and t—1 can be compared to verify if the RCG is improved 
and if the procedure must continue. 



3 The Algorithm PSRCG 

Our strategy consists in starting from the whole set of instances, and removing 
in a first step irrelevant examples containing noise. The principle is based on 
the searching for the worst instance which allows, once deleted, to improve the 
information in the DS. An instance is eliminated at the step t if and only if 
two conditions are filled: (1) the Relative Certainty Gain after the deletion is 
better than before, i.e. RCGt » RCGt-i (>> means ’’statistically significantly 
higher”) and (2) RCGt » 0. 

Once this first procedure is finished (consisting in deleting border instances 
and mislabeled examples), our algorithm executes a post-process consisting in re- 
moving the center of the remaining clusters. This strategy allows to delete useless 
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instances without misclassifying a new example at these points. After the first 
step of our algorithm, and according to a given a risk, all the remaining instances 
could have an uncertainty Uiod^i) = 0. In order to remove those inside clusters, 
we have to increase the geometrical constraint by rising the neighborhood size, 
and remove instances that still have an uncertainty Uiod^i) = 0. This strategy 
is not only suitable for conserving border points, but also for removing the cen- 
ter of clusters. Actually, if we use during this post-process a (fc -|- e)— Nearest 
Neighborhood, border points will likely have an enemy in their neighborhood, 
so they will have an uncertainty Uiodoji) > 0, so they will not be removed. 
On the contrary, instances inside clusters will certainly still have an uncertainty 
Uioc{^i) = 0, and so they will be deleted. According to our experiments, we 
have noted that e = 1 seems to be the best value to empty the clusters of their 
useless instances, that avoids to fix the e parameter, and it permits to maximize 
the storage reduction. The pseudocode of our algorithm PSRCG is presented 
in figure n 



PSRCG ALGORITHM 

t ^ 0; Ap = |S| 

Build the kNN graph on the learning set 
Compute RCGi 

RCGi = Uo-Utot 
Co 

Repeat 

t^t + 1 

Select the instance u) = m.ax(Uioc{^j)) 

If 2 instances have the same Uioc 
select the example having the smallest 
number of neighbors 

Local modifications of the kNN graph 
Compute RGGt+i after removing uj 
Np-1 

Until (.RGGt+i < RGGt) or notCACGt+i >> 0) 

Remove instances having a null uncertainty with their (fc-l-l)-NN 
Return the prototype set with Np instances 

Fig. 1. Pseudocode of PSRCG 



Figure |5] presents an application of PSRCG on a simple 211-example pro- 
posed in [2j. We can note that PSRCG algorithm allows not only to remove 
instances of regions where the local densities of the classes are more or less the 
same, but also to delete a posteriori useless instances. We will see later that 
PSRCG is also suited to deal with mislabeled instances. 

According to these remarks on this simple problem, we can conclude that our 
algorithm consists in a way in keeping the convex hull of instances belonging to 
an inter-quantile interval. The width of this interval depending on the value of 
k, and the thickness of the convex hull depending on the value of (fc -I- e). 
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Fig. 2. Prototype Selection on a simulated unnoisy example. The set on the left 
is the original sample; the subset on the right represents the selected prototypes 
by the algorithm PSRCG (fc = 3). 



4 Experimental Results and Comparisons 

We achieved for this section an irnportant programming work in order to com- 
pare our approach with the other^. Actually, even if some results about these 
algorithms on well known benchmarks are available in literature, the experimen- 
tal set-up is rarely the same. We compare in the first section performances of our 
approach with the Hart’s {CNN in table [J), Gates’s (RNN), Skalak’s (Monte 
Garlo), and Brodley & Friedl’s (CF) algorithms. The basic 1-NN classifier is also 
run for comparison. The comparison with the Wilson & Martinez’s approach will 
be presented later. The first reason of this choice comes from the value of k. Ac- 
tually, while CNN and RNN are based on the 1-NN classifier, Wilson and 
Martinez argue that a two small value oi k {k = 1) results in more dramatic 
storage reduction, but may reduce generalization accuracy. First experiments 
have confirmed this remark, and shown that a higher value is more suitable. 
The second reason is based on the fact that RT2> is certainly in its concepts the 
nearest algorithm of PSRCC, that deserves a special investigation. 



4.1 First Comparisons 

We have applied the different algorithms on several benchmarks coming for the 
high majority from the UGI database repositorj^. Each original set is divided 
into a learning sample LS (2/3 of the instances) and a validation set VS (the 
remaining third). Each algorithm is run on LS, and then the selected subset is 
tested on VS. The Gonsensus Filter (CF) of Brodley & Friedl is run using five 
fcNN base-classifiers (k=l to 5) and the LOOGV during the learning process 
(see [4| for more details). The Monte-Gar lo method uses 100 samples and Np 
prototypes. 

^ The software is available by sending an e-mail to msebban@nniv-ag.fr 
^ http:/ /www. ics.uci.edu/~mlearn/MLRepository.html 
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Dataset 


INN (size) 


CNN 


RNN 


PSRCG 


Monte Carlo 


CF 




Acc 


|CS| 


Acccvw 


MCS Accrnn 


Np 


Accpskcg 


ris = 100 


|CF| 


Acccf 


LED24 


66.5 (300) 


151 


64.0 


141 


62.5 


153 


65.5 


66,0 


244 


66.5 


LED 


83.0 (300) 


114 


69.5 


114 


69.5 


86 


76.5 


76,5 


266 


83.0 


Wh. House 


92.5 (235) 


35 


86.0 


25 


86.5 


49 


90.4 


90.0 


226 


89.0 


Hepatitis 


70.9 (100) 


32 


74.6 


29 


72.7 


36 


61.8 


74.5 


89 


61.8 


Horse Colic 


70.8 (200) 


110 


67.9 


92 


67.3 


107 


70.8 


69,6 


165 


72.6 


Echocardio 


59.0 (70) 


33 


63.9 


33 


63.9 


37 


63.9 


63,0 


57 


65.6 


Vehicle 


68.2 (400) 


178 


61.9 


178 


61.9 


183 


65.9 


68,0 


343 


68.8 


H2 


53.8 (300) 


177 


59.9 


177 


59.9 


177 


54.7 


58,0 


230 


58.0 


Xd6 


76.0 (400) 


157 


72,5 


157 


72,5 


200 


79.0 


74.0 


357 


77.0 


Breast W 


97.9 (400) 


57 


96.3 


50 


96.3 


40 


98.0 


97.7 


390 


98.7 


Pima 


66.0 (468) 


245 


59.0 


245 


59.0 


240 


71.3 


69,0 


372 


75.0 


German 


67.4 (500) 


241 


60.8 


241 


60.8 


232 


65.8 


65,6 


417 


67.6 


Ionosphere 


91.4 (185) 


52 


81.9 


40 


80.2 


81 


91.4 


84,3 


162 


89.6 


Tic-tac-toe 


78.2 (600) 


225 


76.0 


218 


76.3 


254 


82.1 


74,6 


512 


80.1 


Average 


74.4 


40% 


71.0 


38% 


70.7 


42.2% 


74.1 


73,6 


85.9% 


75.2 



Table 1. Comparison between five PS algorithms on 14 benchmarks 



□ 

From the results in the table |T] and according to the criteria measuring the 
relative strengths and weaknesses of each algorithm, several observations can 
be made. RNN has the highest storage reduction on average (38%), but has 
the lowest accuracy generalization. CNN presents a small gain in generalization 
(71.0 vs 70.7), but requires more instances (+2.0 on average). The Monte-Carlo 
sampling presents good results, but it requires to know the a priori number of 
prototypes Np, to fix in advance the number of samples, and is computation- 
ally more expensive. The Consensus Filter improves the classification accuracy. 
Despite these interesting results, it eliminates few instances from the learning 
set (about 14% on average), and so could not be considered as a real PS al- 
gorithm (that is not its first goal by the way). Moreover, it requires to fix the 
number of base-detectors, and the type of each learning algorithm. Nevertheless, 
as CF improves the 1-NN accuracy in the high majority of cases (11/14), it 
could consist in a good pre-process stage of PS, that we will see in the next 
section. Finally, PSRCG has a good generalization accuracy, very close to the 
INN classifier (74.1% vs 74.4% on average), despite an interesting compression 
rate around 58%. It is the only one method that allows to reach the three objec- 
tives: (i) reduce the instance storage and (ii) not compromise the generalization 
accuracy, and (iii) control the learning speed. 

□ 

4.2 Comparison with RTS 

This section contains separate experiments conducted to study the efficiency 
of PSRCG vs i?T3, supposedly one of the most efficient PS algorithm in that 
field [ 2 ]. We kept here the same experimental set-up. The reduction algorithms 
RTS and PSRCG were implemented using fc = 5. They were tested on more DS 
(19) than before and compared to a fc-NN classifier. In order to see the positive 
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Dataset kNN PSRCG RTS CF+PSRCG 





size(%) AcCfcjviv 


size(%) AcCpsRCG 


size(%) Acc„.j ,3 


size(%) Acc, 


CF+PSRCG 


LED24 


100 


66.5 


46.3 


69.0 


7.0 


67.0 


43.3 


67.5 


LED 


100 


83.0 


19.3 


83.0 


13.3 


77.5 


20.0 


86.5 


Wh. House 


100 


89.5 


31.5 


88.0 


7.2 


89.0 


21.2 


87.5 


Hepatitis 


100 


70.9 


42.0 


65.5 


6.0 


65.5 


8.0 


65.5 


Horse Colic 


100 


70.8 


45.5 


70.8 


13.0 


64.9 


41.5 


67.3 


Echocardio 


100 


62.3 


31.4 


63.9 


21.4 


65.6 


20.0 


65.6 


Vehicle 


100 


68.2 


40.8 


64.8 


14.5 


65.3 


33.0 


69.3 


H2 


100 


53.8 


96.3 


54.3 


20.7 


52.4 


36.0 


51.9 


Xd6 


100 


76.0 


50.8 


75.0 


12.8 


75.0 


46.0 


74.0 


Breast W 


100 


98.0 


17.5 


99.0 


3.0 


98.7 


9.3 


98.3 


Pima 


100 


66.0 


42.1 


73.7 


5.6 


69.0 


42.0 


73.7 


German 


100 


66.8 


47.2 


69.4 


6.2 


67.2 


33.6 


66.8 


Ionosphere 


100 


91.4 


53.5 


89.7 


10.8 


90.5 


40.5 


88.8 


Tic-tac-toe 


100 


78.2 


58.8 


79.9 


12.2 


67.0 


42.3 


78.2 


Australian 


100 


76.6 


57.3 


78.3 


6.8 


67.9 


34.5 


76.2 


Balance 


100 


75.1 


47.0 


76.0 


11.0 


78.7 


20.0 


73.3 


Bigpole 


100 


68.5 


43.0 


68.0 


17.3 


68.0 


40.0 


70.5 


Waves 


100 


78.6 


54.3 


75.6 


24.0 


76.1 


36.6 


77.1 


OptDigits 


100 


96.2 


32.7 


94.6 


18.5 


91.9 


30.2 


94.8 


Average 


100 


75.60 


45.1 


75.71 


12.2 


73.54 


31.4 


75.4 



Table 2. Comparison between PSRCG, RTS and CF+PSRCG on 19 datasets 



effect of the Brodley & Friedl’s approach, which is more specialized for removing 
mislabeled instances than PSRCG, we decided to run our algorithm after a pre- 
process consisting in deleting such instances by the CF algorithm. Results are 
presented in table ID We can make the following observations: 

1. RT3 is suited to dramatically decrease the size of the learning set (12.2 % on 
average). In compensation, the accuracy in generalization is a little damaged 
(about -2.0). 

2. PSRCG requires a higher storage (45.1% on average), but interestingly, it 
allows here to slightly improve the accuracy (75.71% vs 75.6%) of the stan- 
dard A:NN classifier. It confirms that PSRCG seems to be a good solution to 
select relevant prototypes and then reduce the memory requirements, while 
not compromising the classification accuracy. 

3. Globally, PSRCG is more stable than RTS. Actually, computing a mean 
deviation \AcCkNN — Accpsrcg\ and \AcCkNN — Accrt^\, we observe that 
PSRCG provides an accuracy closer to Accknn (±2.0 on average) than i?T3 
(±3.3). It means that PSRCG is less coarse than RT3, and then it does not 
compromise a lot the classification accuracy. 

4. Associated to CF, our PSRCG algorithm allows to reduce the storage re- 
quirements, without falling much in accuracy. Unfortunately, the drawback 
of this association comes from the high computational costs during the learn- 
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PS Algorithm 


Noise-Free 


Size % 


Noisy 


Size % 


kNN 


75.60 


100.0 


70.0 


100.0 


PSRCG 


75.71 


45.1 


74.4 


48.8 


RTS 


73.54 


12.2 


70.7 


11.5 


CF + PSRCG 


75.40 


31.4 


73.8 


34.9 



Table 3. Average accuracy and storage requirements when noise is inserted 



ing stage. However, if the use of this procedure can result in greater storage 
reduction and then smaller computational costs during the generalization, 
the extra computation during learning can be very profitable. 



4.3 Noise Tolerance 

After analyzing the storage reduction, the generalization accuracy and the learn- 
ing speed, we test in this section the sensitivity of our algorithm to noise. A 
classic approach to deal with this problem consists in adding artificially some 
noise in the DS. This was done here by randomly changing the output class of 
10% of the learning instances. This way to proceed allows to assess the ability 
of the algorithm to correctly predict the output even if the learning DS is noisy, 
i.e. some instances are mislabeled. We do not test in this section the CNN and 
RNN algorithms on noisy samples, because they are known to be very sensitive 
to noise m- Table El shows the average accuracy and storage requirements for 
PSRCG, RTS, fcNN and CF + PSRCG over the 19 datasets already tested. 

We can note that the accuracy for the fcNN algorithm dropped 5.6% on 
average. In such a noisy context, RTS achieved accuracy higher than the A:NN 
algorithm, while using less instances. Moreover, RTS on noisy instances selected 
less instances than on noise- free DS (11.5% vs 12.2% on average). However, it did 
fall much in accuracy when noise is inserted (-2.8 on average). On the other hand, 
PSRGG has the highest accuracy (74.4%) of the four algorithms, while storage 
requirements is relatively controlled. It presents a good noise tolerance because it 
does not fall much in accuracy (-1.3 for 10% of noise), versus -2.8 for RTS, -1.6 for 
GF+PSRGG and -5.6 for the basic fcNN. Finally, GF+PSRGG, despite higher 
computational costs during the learning stage, allows higher storage reduction 
than PSRGG, but falls a little in accuracy (74.4 vs 73.8) 

5 Conclusion 

Until recently, the accuracy seemed to be the undeniable criterion to deal with 
the feature selection problem. Recent works try to empirically and theoretically 
analyze the effect of the optimization of other criteria, such as criteria coming 
from the information theory |llll20j . or using new properties of boosting [15] . As 
far as we know, this new strategy is not very wide spread in prototype selection 
(except |7]). This new way to deal with the problem is very interesting because 
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PS Algo. 


Stor. Reduc. 


Ac. Gen.. 


Noise Tol. 


Learn. Speed 


PSRCG 


** 




*** 


*** 


RT3 




*** 


** 


*** 


CF 


* 




*** 


** 


CF+PSRCG 


*** 


*** 


*** 


* 


CNN 


** 


* 


* 


*** 


RNN 


** 


* 


* 


** 


M. Carlo 


to fix 


** 


** 


* 



Table 4. Evaluation (number of stars) of the PS algorithms aeeording to 4 
discriminant criteria 



it allows to isolate the selection as a pre-process independent on the learning 
methods, and a given ad hoc algorithm. The criterion presented in this paper 
tries to take into account this phenomenon, while attempting to maximize four 
main performance criteria. Table attempts to sum up strengths and weaknesses 
of each PS algorithm tested in this paper. PSRCG presents very interesting 
properties controlling all the criteria. Experimental results and comparisons show 
the relevance of our criterion in comparison with the classic accuracy. The use 
of the NN topology as an information graph ensures a low complexity of our 
algorithm, even if local modifications of neighborhoods are necessary after the 
deletion of an instance. Our future works will deal with the possibilities to use 
other geometrical structures to build the neighborhood graph. Actually, even if 
the fcNN graph is here used only as an information graph, without using the 
accuracy of the fcNN, it represents nevertheless a given classifier. We are testing 
now the effect of other independent graphs, such as the Minimum Spanning Tree, 
the Delaunay Triangulation, or the Gabriel’s Graph. 
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Abstract. Today’s case based reasoning applications face several chal- 
lenges. In a typical application, the case bases grow at a very fast rate 
and their contents become increasingly diverse, making it necessary to 
partition a large case base into several smaller ones. Their users are over- 
loaded with vast amounts of information during the retrieval process. 
These problems call for the development of effective case-base mainte- 
nance methods. As a result, many researchers have been driven to design 
sophisticated case-base structures or maintenance methods. In contrast, 
we hold a different point of view: we maintain that the structure of a 
case base should be kept as simple as possible, and that the maintenance 
method should be as transparent as possible. 

In this paper we propose a case-base maintenance method that avoids 
building sophisticated structures around a case base or perform com- 
plex operations on a case base. Our method partitions cases into clus- 
ters where the cases in the same cluster are more similar than cases in 
other clusters. In addition to the content of textual cases, the cluster- 
ing method we propose can also be based on values of attributes that 
may be attached to the cases. Clusters can be converted to new case 
bases, which are smaller in size and when stored distributedly, can entail 
simpler maintenance operations. The contents of the new case bases are 
more focused and easier to retrieve and update. To support retrieval in 
this distributed case-base network, we present a method that is based on 
a decision forest built with the attributes that are obtained through an 
innovative modification of the IDS algorithm. 



1 Introduction 

Case based reasoning |1SI11I1()J is a technique to reuse past problem solving ex- 
periences to solve future problems. The basic idea is based on analogy, whereby 
similar problems are found and their solutions are retrieved and adapted for 
solving the new problem. The effectiveness of a CBR system critically depends 
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on the speed and quality of the case base retrieval process. If the retrieved cases 
are not accurate or the retrieval performance is too low, then a CBR system can- 
not function as expected. If too many seemingly similar solutions are retrieved, 
as in the case of some web browsers where thousands of items are returned, a 
CBR system cannot provide its users with much assistance either. The purpose 
of this paper is to present a novel case-base maintenance and retrieval system 
aimed at improving the accuracy and performance of a CBR system when the 
number of cases gets large. 

Our approach is based on two related ideas. The first one is clustering, 
whereby a large case base is decomposed into groups of closely related cases. 
Based on the partitioned structure, we then create a collection of distributed 
case bases across a network on different sites, where each element in the dis- 
tributed case base structure is one cluster created as a result of the clustering 
process. From each cluster we build a representative case, which takes a subset 
of the attributes (or “features” in some literature) in that cluster as its repre- 
sentation. This step builds a two-level hierarchical structure. 

Our second idea is to allow a user to retrieve the distributed case bases by 
incrementally selecting the attributes that are information-rich and can cover 
the entire distributed case base structure. These attributes are presented to the 
user in an interactive manner, whereby a user chooses some attributes to provide 
values with. In each iteration, a subset of the case-base clusters are removed from 
consideration. The process repeats until a target case base is identified. At this 
point, a case based reasoning system is used to rank and identify the cases in 
the case base in order to find the final answer. 

Our work is motivated by our experience in case base maintenance research. 
Case-base maintenance refers to the task of adding, deleting and updating cases, 
indexes and other knowledge in a case base in order to guarantee the ongoing 
performance of a CBR system. Case-base maintenance is particularly important 
when a case based reasoning system becomes a critical problem solving system 
for an organization. Recently, there has been a significant increase in case-base 
maintenance research. One branch of research has focused on the ongoing main- 
tenance of case-base indexes through training and case base usage mm - 
In |13] , a prototype-based incremental network is proposed to accelerate the in- 
dexing process in CBR. This network is built on an abstract hierarchy of a given 
case base, which is a clustering structure for the case base. Researchers have 
also focused on increasing the overall competence of the case base through case 
deletion I19I6I1I19I16I . Leake and Wilson EH provide an excellent survey of this 
field. 

Most existing works on case-base maintenance are based on one of two gen- 
eral approaches: either build an elaborate case-base structure and maintain it 
continuously, or use a sophisticated algorithm to delete or update a case base 
when its size reaches a certain threshold. In contrast, we take a different view 
in this paper. We maintain that to ensure effective case-base maintenance, both 
the structure of the cases and the maintenance method must be simple. This 
philosophy is based on the following observations on the changing landscape 
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of computing in general. First, in recent years we have seen that disk space is 
getting increasingly cheaper and networked databases are increasingly more ef- 
ficient. This makes it highly feasible to store gigabyte of case data and retrieve 
the data efficiently. Second, the approach that maintains a highly structured 
hierarchical CBR system in order to maintain a case base is likely to encounter 
the recursive problem of maintaining the structure itself on a continuing basis. 
Third, case-base maintenance algorithms that delete cases are likely to be short- 
sighted and can easily erase invaluable corporate memory which may prove to be 
useful in the long run. These algorithms are often sophisticated, thus incurring 
high computational overhead. 

Based on these observations, we have adopted a new strategy whereby we 
keep all cases and use simple method to maintain the case bases. Our idea is to 
create multiple, small case bases that are located on different sites. Each small 
case base contains cases that are closely related to each other. Between different 
case bases the cases are farther apart from each other. Our approach is to allow 
the cases to be added and deleted at each small case base without affecting the 
whole. This distributed concept ensures that each case base is small and it is 
easier to maintain each one individually. Figure [U illustrates the architecture. 




Fig. 1. System Architecture. In the figure “CBC” refers to case base clusters. 



2 Case Base Clustering Using CBSCAN 

In this section we present an algorithm which uses cluster analysis to build a 
case base maintainer. The basic idea is to apply clustering analysis to a large 
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case base and efficiently build natural clusters of cases based on the density of 
attribute values. 



2.1 Clustering Techniques in CBR 

Clustering technique is applicable to CBR because each element in a case base is 
represented as an individual, and there is a strong notion of a distance between 
different cases. In the past, some attempts in inductive learning and neural net- 
work have been made in applying clustering techniques to case-based reasoning 
|T^ . The basic idea of the inductive methods is to build a classification tree 
based on analysis of information gain that are associated with each attribute or 
question used to represent a case mm- A drawback of inductive approaches 
are that they are expensive to run for large case base [iFTi] . When case bases 
change, it is necessary to rebuild the whole structure. It is also difficult to tackle 
the problem of missing values; to compute a decision tree, most inductive meth- 
ods assume that all cases are associated with all attributes. From our experience 
in applying case based reasoning, this assumption is generally false. 

We choose a density-based clustering method as our basis. The density-based 
method overcomes the defects of inductive clustering methods and many other 
clustering methods because it is relatively efficient to execute and does not re- 
quire the user to pre-specify the number of clusters. Density-based methods are 
based on the idea that it is likely that cases with the same attributes should 
be grouped into one cluster. Intuitively, a cluster is a region that has a higher 
density of points than its surrounding region. For a case, the more cases that 
share the same attributes with it, the larger its density is. The density-based 
method originates from a method called GDBSCAN [nug], proposed for data 
mining. The main feature of the algorithm is that it relies on the density of data, 
so that it can discover clusters of arbitrary shape which is very important for 
CBR to group all similar cases together. 

More specifically, GDBSCAN accepts a radius value Eps based on a distance 
measure, and a value MinPts for the number of minimal points. The latter is 
used to determine when a point is considered dense. Then it iteratively computes 
the density of points in an N-dimensional space, and groups the points into 
clusters based on the parameters Eps and MinPts. The algorithm is found to 
outperform many well known algorithms such as K-prototype. However, a main 
problem of GDBSCAN is that a user must input a radius value Eps in order to 
construct the partitions. However, if the user is not a domain expert, it is difficult 
to choose the best value for Eps. In response, we have developed a method for 
finding a near-optimal Eps value through a local search process. We call this new 
algorithm CBSCAN. CBSCAN is based on the observation that the minimum 
radius value Eps is critical in determining the quality of a partition. Thus, a 
local-search algorithm is used to find a locally optimal Eps value that optimizes 
a certain quality measurement. 

Before introducing CBSCAN, we must first discuss how to measure the qual- 
ity of a given partition of a case base. We use the new Condorcet criteria (NCC) 
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which is based on the idea that for a partition to be good cluster is a good par- 
tition has small intra-cluster distances and large inter-cluster distances. More 
precisely, let F be a partition; it is a set of clusters. V can be represented as a 
matrix such that yij = 1 if and only if cases i and j are in the same cluster. The 
quality of a partition NCC{Y) can be represented as : 



^ ^(to - 2d,j)vij = (1) 

i=l i=l 

In this equation, n is the total number of cases, dij is the distance between two 
cases i and j, m is the number of attributes or features in a domain, and Cij = 
m — 2dij = (to — dij ) — dij is the difference between the number of agreements 
between any two elements i,j who are in the same cluster and the number of 
disagreements between them. The NCC measures the quality of a partition by 
ensuring that it will not favor large number of clusters. Therefore it can be used 
as a criterion to optimize the clustering result. For numerical attributes, the 
distance measures can be modified to be categorical by discretization. 



2.2 The CBSCAN Algorithm for Clustering Case Bases 

We now introduce the case-base clustering algorithm CBSCAN. In our tests 
(not shown in this paper due to space limitations) we know that the parameter 
MinPts is not critical in the definition of density, so we arbitrarily set the 
MinPts as a constant such as 2. To find a value for Eps in order to get a good 
partition, we modify Eps by the NCC of the partition. In CBSCAN, Eps is 
always moved toward the trend that leads to a larger NCC value. When NCC 
first increases and then decreases with the change in Eps value, we know that 
we have passed by a locally maximal NCC point. We then let Eps to oscillate 
around this point until an approximate Eps value that produces the locally 
optimal NCC value is found. 

Algorithm 2.1 NCC-based clustering algorithm CBSCAN 
Input: A case base D 

Output: A set of clusters to partition the case base D 

Method 

1. Initiate parameters: 

MinPts = 2, 

Eps = maximal distance between any two cases 
BestQuality is set a very small number 
BestEps = Eps 
Direction = +1; 

K is a constant larger than 1 
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2. While( iteration times < Maxiter Number) 

GDBSCAN(Eps, MinPts); 

Current Quality = ComputeQuality( Bps ); 

Iff CurrentQuality > BestQuality ) 

BestQuality = CurrentQuality; 

BestEps = Eps; 

Else 

Direction = - Direction ; 

Epsinterval = Epsinterval / K ; 

End if 

Iff Direction > 0 ) 

Eps = Eps - Epsinterval; 

Else 

Eps = Eps + Epsinterval; 

End if 

iteratenumber+ +; 

End While 

The function ComputeQuality() returns the NCC value for the current par- 
tition. We will show the test results of this algorithm in Section[3l For each case 
in the case base, we need to check whether it is dense. This process is repeated 
m times, so the total time for this algorithm is, in fact, Ofm * n* logn). 



2.3 Building Distributed Case Bases 

After a large case base is partitioned, a domain expert can build some smaller 
case bases on the basis of clustering result. We call the large case base the OLCB 
(original large case base), and the smaller case base built with a cluster the CBC 
(case base cluster). Each CBC has a case base name and a list of keywords. The 
case name is the description of the case base. It is a set of the most frequently 
used words by the cases in the case base. There is a set of attributes associated 
with the case base. They are all the attributes that are associated with the cases 
in the case base. The weight of the cluster with that attribute-value pair is the 
average weight of all the cases in the cluster. 

Consider an example case base for a Cable-TV diagnosis application. Suppose 
the that the initial case base consists of 8 cases in which some are for diagnosing 
VCR problems and others for TV problems. As shown in Figure E] after cluster- 
ing, Casel, . . ., Cased are grouped into Clusterl which is about VCR, and Case 
5, Case 6 and Case? are grouped into Cluster2, which is about some TV prob- 
lems, while Case 8 is the noise. The domain expert builds a case base ” VCR” for 
Cluster 1 under a directory called ’’VCRdomain” and ”TV” for cluster 2 under 
a directory called ’’TVdomain”. 
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Case 1 


Cluster name: VCR Problem 




Case 2 


Cluster Keywords: VCR, tape, ... 




Case 3 


Attrbute-value: (q1 , a1 1 )(q1 ,a1 2)( a2, 
a21) 




Case 4 


Link: -jwijb/vcrdomain 




Cases 


Cluster name:TV Problem 






Cluster Keywords: sound, pioture, 
remote control 






Attrbute-value: (q1 , a1 3)(q2,a22)( a2, 
a23) 


Cases 


Link; ~jwub/tvdomain 


Cases 


■ ■ — ^ 


Case? 


case 8: no power 


Cases 







Fig. 2. Clustering a Cable-TV case base 



3 Information-Guided Cluster-Retrieval Algorithm 

In this section we present our second step: retrieve the most similar case-base 
cluster (CBC) by analysis of information gains of the attributes. Our method is 
summarized as the follows: 

— combine information theory with CBC retrieval to find the attribute that 
can mostly distinguish the CBCs built with the algorithms introduced in 
Section |2] 

— deal with CBCs that are irrelevant with a selected attribute. To do this a 
decision forest instead of a decision tree is built. The roots of the decision 
trees in the forest cover all the CBC’s under them. 

— allow the users to interactively browse the CBCs to find the most similar 
CBC instead of traditional searching the pre-built structure. 

— create the decision forest dynamically based on information theory as the 
users narrow down their search. 

Given a collection of case bases, we want to select a subset of the attributes 
to present to the user. A user can choose among this set of attributes a subset 
to provide answers or values with. For example, in a retrieval process, a user 
might be given attributes oi, 02, and 04. The user might choose to answer a\ 
and 04. These answers will eliminate a subset of clusters from consideration and 
promote another subset as possible candidates. The system ranks those case base 
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clusters that are highly likely to contain the final result based on these answers 
and allow the user to continue browsing into any chosen subsets. The process 
continues until a final cluster is identified. At this point, a simple CBR system 
can be used on this case base for case retrieval. 

There are several requirements for this process. First, to ensure coverage, the 
attributes selected by the retrieval algorithm must cover all case base clusters. 
For most case base applications, not all cases are associated with all attributes. 
Therefore, no single attribute may cover all case base clusters. Thus, we must 
select more than one attribute so that the whole subset will cover the case bases. 
This induces a decision forest instead of a decision tree. Second, we still wish 
the attributes we present to the user will have the maximal information value. 

Our attribute-selection algorithm is informally described as follows. First, 
for all attributes that are associated with the case bases, we calculate their 
information-gain ratios based on Quinlan’s algorithm |15| . Then, we iteratively 
select a collection of attributes so that all case bases in a current “candidate set” 
are covered. Then, we present those attributes in the form of questions to the 
user, and obtain values as answers to the questions. 

We now illustrate this algorithm through an example. Suppose there are 
ten cases in a case base. Five CBCs were built, each holding two cases. The 
descriptions of cases are represented by four attributes. Each case is associated 
with some attribute- value-weight tuples. The case base is shown in Table 



Case Name 


Attr 1 


Attr 2 


Attr 3 


Attr 4 


CBC No. 


Case 1 


(a,100) 




(a, 100) 


(a, 100) 


1 


Case 2 


(a,100) 




(a, 100) 


(b,100) 


1 


Case 3 


(a,100) 




(b,100) 


(c,100) 


2 


Case 4 


(a,100) 




(b,100) 


(d,100) 


2 


Case 5 




(a, 50) 


(d,100) 


(a, 100) 


3 


Case 6 




(a,50) 


(a, 100) 


(b,100) 


3 


Case 7 




(a,100) 


(c,100) 


(a,100) 


4 


Case 8 




(b,100) 


(d,100) 


(a,100) 


4 


Case 9 




(b,100) 




(b,100) 


5 


Case 10 




(b,100) 




(c,100) 


5 



Table 1. Example Case Base For CBC Retrieval 



The attribute with the largest information-gain ratio for all the CBC’s is 
located as the root of a decision tree. The CB retrieval system sorts all the 
attributes by their information gain ratio. The information gain ratio for the 
attributes are listed in Table El Attribute 2 has the largest information gain 
ratio. It is set as the root for the first decision tree in the decision forest. Before 
Attribute 2 is returned to the user, it is checked whether Attribute 2 covers all 
the CBCs. We discover that Attribute 2 only covers CBC 3, CBC 4 and CBC 
5. Thus, we continue to calculate the information-gain ratio of Attribute 1, 3 
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Attribute 


Information Gain Ratio 


Attribute 2 


8.72 


Attribute 4 


6.99 


Attribute 3 


4.15 


Attribute 1 


0 



Table 2. Sorted Attributes Information-gain ratio for all CBCs 



and 4 for CBC 1 and CBC 2. Attribute 3 has the largest gain and it covers 
the remaining two CBCs. Since Attribute 2 and Attribute 3 cover all the CBCs, 
these two are returned to the user. The decision forest is illustrated in Figure |3] 





Fig. 3. Decision Forest Example 



4 Empirical Tests for Clustering 

So far, we have introduced a novel clustering method CBSCAN and a information- 
gain guided cluster-retrieval method into the framework of CBR. This system is 
implemented in a case-based reasoning system. In this section, we demonstrate 
the effectiveness of the algorithms through experiments. All the algorithms are 
implemented in Microsoft Visual C-|— All experiments have been run on 
Pentium PC with 166 MHz and 96 MB memory. 
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In this section, we test a number of hypotheses which are raised throughout 
this paper. 

— We show through an experiment that the choice in Eps value which we 
mentioned in Section |2] has a dramatic effect on the quality of the resultant 
partition generated using the density based methods. Our experiment shows 
that Algorithm CBSCAN can indeed find an Eps point where a near optimal 
clustering result is defined. 

— We show that our clustering method CBSCAN can indeed scale up to large 
case bases, and can in fact outperform well known algorithms such as the 
K-prototype method [8] or IDS [la- 
in this experiment, we used a case base with 45 cases in text format from a re- 
alistic Cable-TV case base. The text file has the structure as shown below,where 
each case has a case name, a problem description and a solution. 

1. Name: no picture; white screen; faulty TV, descrambler, converter 

2. Reason: Faulty Cable-TV set but the descrambler may be the problem. 

3. Solution: Connect cable directly to TV; if there is a picture, the descrambler 
is the problem. 

Figure m illustrates how Eps varies with the NCC quality measurement. We 
can tell that after eight repetitions in Algorithm CBSCAN 12.11 the Eps found is 
almost optimal. In particular, the Eps is first set to 0.92, the maximum distance 
between cases. When Eps = 0.53, the clustering algorithm will get the optimal 
clustering result where the quality defined by NCC is 13.67. The sequence of the 
Eps is 0.92 ^ 0.77 ^ 0.62 — > 0.47 ^ 0.50 — > 0.53 ^ 0.53. It takes 7 times to 
get the best result. In the last two steps, the Eps values are all 0.53, because 
the increase at this point is less than 0.01 which is the precision defined by the 
system. 




Fig. 4. NCC as a function of Eps 



The total CPU time spent in finding the clusters and the final clustering 
quality values computed using NCC (Equation [T|) are shown in Table [31 The 
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Clustering Method 


Time( seconds ) 


New Cordecet Criteria 


CBSCAN 


0.7 


13.67 


K-ProtoType 


12 


-1075 


IDS 


66 


-120 



Table 3. CPU Time and NCC Results on a Cable-TV Case Base 



ID 


# Cases 


^Attributes 


CBSCAN 


K-Pro 


IDS 


1 


1118 


12 


49872 


-383764 


95585 


2 


2076 


20 


39645 


-17878 


36794 


3 


3804 


25 


-249271 


-98863 


-1432227 


4 


7000 


30 


13397 


-1330 


5698 



Table 4. Experiment Results on NCC 



extended clustering algorithm CBSCAN (Algorithm 12.1 is compared with well 
known clustering algorithms k-Prototype [H] where the parameter k is set as 5, 
and with the IDS algorithm m- 

Next, we used data from UCI repository of machine learning databases and 
domain theories (http://www.ics.uci.edu/ mlearn/MLRepository.html). The at- 
tributes in the case bases are categorical. In case based reasoning, an attribute is 
not associated with every case especially when a case base is large, because it is 
common that some attributes may not be relevant to a case. This missing- value 
phenomenon was not common in the data from the UCI repository, because the 
number of missing values in the case base is very small. To simulate real case 
bases where there may be lots of missing values. We merged several case bases 
together. If an attribute appear in two case bases, these two attributes will be 
merged as one. This modification created many missing values in the resultant 
case brae (the total size is 8,000 cases). 

There are 4 groups of data in this test. The clustering result is shown in 
Table m where the last three columns show the quality measure of the computed 
clustering according to NCC. The CPU time result is shown in Table E] with size 
of a case base. From these results, we can see that, with a few exceptions, our 
CBSCAN algorithm easily outperforms both K-prototype and IDS algorithms 
in both CPU time and result quality. 



ID 


#Cases 


^Attributes 


CBSCAN 


K-prototype 


IDS 


1 


1118 


12 


8 


16 


29 


2 


2076 


20 


21 


53 


40 


3 


3804 


25 


29 


79 


126 


4 


7000 


30 


99 


147 


223 



Table 5. Experiment Results on CPU Time 
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5 Conclusions and Future Work 

This work has made two linked contributions. First, a new case-base clustering 
algorithm is presented that is both efficient and effective in dealing with large 
case bases. Second, the clustered case bases are organized into a distributed 
structure so that an interactive process can proceed for a user to identify the 
target cases. Since the size of a cluster is relatively small, any simple CBR 
retrieval method can be used once a target case base found in the second step. 
These two algorithms support our initial philosophy of maintaining large case 
bases by keeping all cases around, and maintain only the simplest structure. 
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Abstract. In this paper we focus on the problem of integrating knowl- 
edge bases expressed in a description logic. To this end, we propose three 
basic operations: union, intersection and renaming. First, the semantics 
of these compositional operations is studied abstracting away from im- 
plementation details. Then, we present an implementation of the pro- 
posed operations, for knowledge bases expressed in the language ACN 
extended with recursive definitions of concepts, which transforms com- 
positions of knowledge bases into knowledge bases. This transformation 
is sound and complete with respect to the semantics referred before. 



Keywords: Knowledge Representation (Description Logics) and 
Information Integration. 



1 Introduction 

The possibility of re-using and composing existing small-sized software units is 
becoming an increasingly important issue in the construction of new computing 
systems. Since knowledge bases are important components of many computing 
systems, the problem of combining knowledge bases arises naturally. 

The main goal of this paper is to propose a basic set of compositional op- 
erators that can be used to combine separate concept-based knowledge bases, 
i.e., knowledge bases written in a concept description language (or description 
logic). In particular, we focus our attention on the language ALN , extended 
with recursive definitions of concepts. 

To the best of our knowledge, the problem of integrating concept-based 
knowledge bases has been addressed for the first time in In that work, 

knowledge bases are expressed in a very simple, tractable language, called AC, 
which is usually considered as a core-language of a family of well-studied concept 
description languages jS]. This family of languages is obtained by extending AC 
with other constructs. Since ACN is one of the most powerful extensions of AC 
with tractable reasoning procedures, it seemed to us to be a suitable candidate 
to be studied from the point of view of integrating description logic theories. A 
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closer look at the language ACM has revealed some interesting problems that 
are not raised with AC. 

In [H] a description logic is used to integrate information in the context of 
distributed databases and data warehousing. The authors propose a declarative 
methodology for information integration. A description logic with n-ary relations 
is used to model data from multiple sources. Thus, it is possible to build a unified 
representation of the data and exploit the reasoning techniques of the underlying 
description logic when querying the global information system. 

In this paper, we are concerned with information integration in the area of 
knowledge representation. Concept-based knowledge bases are seen as logical 
theories that can be merged (or composed). To this end, we propose three ba- 
sic operations for composing knowledge bases: union (V), intersection (A) and 
renaming (RN). Let Si and S2 be two knowledge bases. Si V S2 represents 
a knowledge base containing the logical consequences of Si or S2- The inter- 
section, Si A S2, captures the set of common logical consequences of Si and 
S2- Finally, renaming is a binary operator, RN{Si, p), where p is a renaming 
function, that replaces names of relations or concepts in Si by some other ones 
according to p. 

It is important to mention that compositional operators have also been inves- 
tigated in the context of logic programming |B], for combining not only definite 
programs [7j, but also general logic programs |3- 

We characterize the compositional operators in two distinct ways. Firstly, 
we present a semantics that abstracts away from implementation details. Sec- 
ondly, we define a syntactic transformation r, from knowledge bases expressions 
to knowledge bases, for representing the operational behaviour. An important 
aspect is that the intersection of two A£Af-knowledge bases may have an infi- 
nite number of logical consequences that can only be captured in a finite way by 
concepts defined recursively. Although the intersection of two knowledge bases 
expressed in AC may also have an infinite number of logical consequences, there 
is no need for recursive definitions (as it is shown in mnig). 

It is worth noting that we make use of notions such as “least common sub- 
sumer” and “most specific concept” (presented in O HD]) to define the oper- 
ational semantics of the compositional operators, whereas the operational se- 
mantics defined in m ng, for knowledge bases expressed in AC, relies on a 
completely different technique: completion rules. 

The rest of the paper is organized as follows. Section E] is an introduction: 
Section o and Section 0 are concerned with the syntax and the semantics of 
ACM, respectively; Section F2., SI introduces briefly the notions of least common 
subsumer and most specific concept (see 0 HD] for a full description) . Section E] 
is devoted to the declarative semantics of the compositional operations, and sev- 
eral examples are included to motivate the need and usefulness of the proposed 
operators. Next, Section|?]is concerned with the syntactic transformation r, from 
knowledge bases expressions to knowledge bases. Finally, Section^ contains some 
concluding remarks and possible future extensions. 
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2 The Language A.CAf'. Basic Definitions 

2.1 Syntax 

The language used in this work is ACN [S] extended with cyclic concept descrip- 
tions. 

Let (Atom, Def, Rel, Ind) be a tuple of pairwise disjoint sets of atomic con- 
cepts, Atom, defined concepts, Def, binary relations, Rel, and individuals, Ind. 
In the sequel, we assume that A S Atom, D G Def, R G Rel, and i G Ind. 

An ACM -terminological axiom \ is defined by the following grammar: 

X ^D = Ct 

Ct^T \±\D\A\^A\CtinCt2\ yR-Ct I (< ui R) \ (> «2 R) , 

where ni is a non-negative integer and ri 2 is a positive integer. 

A TBox T is a finite and nonempty sequence of terminological axioms such 
that: 

— each defined concept D appears at most once on the left-hand side of a 
terminological axiom (i.e., each definition is unique); and 

— every defined concept that occurs on the right-hand side of a terminological 
axiom xi G T, occurs also on the left-hand side of a terminological axiom 

X2 e T. 

A concept D is defined in a TBox T iff it occurs on the left-hand side of a 
terminological axiom y G T. 

An ACM-concept C and an ACM-assertion j3 are defined by the following 
grammar: 

C — > T I _L I (D, T) I A I I Cl n C2 I VR.C | (< m R) \ (> ri2 R) 

P — > i ■■ C \ R{h,i2) , 

where ni is a non-negative integer, U 2 is a positive integer, and D is a concept 
defined in the TBox T. Concepts of the form (< ni R) or (> U 2 R) are usually 
called numeric restrictions. 

An ACM -knowledge base A is a finite set of ACAf-assertions. 

Notice that we do not preclude the existence of cyclic terminological axioms 
and, thus, of recursively defined concepts. Moreover, remark that each defined 
concept D occurring in a knowledge base A is coupled to its definition, i.e., is 
associated with a TBox T in which it is defined. Thus, these definitions have a 
“local flavour” . 

2.2 Semantics 

The semantics of description logic theories is defined model-theoretically: a con- 
cept represents a set of elements of a given domain, and a relation denotes a 
set of pairs of elements of that domain. Intuitively, a concept may denote: the 
domain of discourse (T); the empty set (A); the set of elements that belong to an 
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atomic concept (A) or to its complement (^^); the intersection of two concepts 
(Cl nC2); the set of elements whose image by a relation is contained in a concept 
(Vi?.C); or the set of elements whose image by a relation contains at most rii 
elements ((< n\ i?)), or at least elements ((> R)). 

An assertion [3 states that an individual belongs to a concept {i : C), or 
that two individuals are related by some relation The meaning of a 

terminological axiom D = C is that concepts D and C represent the same set. 



Example 1 Let Sfam be a knowledge base, in which son, daughter, and child 
are relations; mother is a defined concept; woman is an atomic concept; and Joe, 



Fred, Ann, and Kate are individuals. 




dfam ■ son( Joe, Fred) 


Fred is Joe’s son. 


daughter(Fred, Ann) 


Ann is Fred’s daughter, 


daughter(Ann, Kate) 


Kate is Ann’s daughter. 


Fred : Vchild. (mother, 7)„) 


All children of Fred are mothers. 


where Tm = < mother = womcin □ (> 1 chi Id) >, which defines mother 
woman with children. 



To simplify the remainder of this section, we assume that (Atom, Def, Rel, 
Ind) is a fixed alphabet. An interpretation of ACM is a pair I = (A^, where 
is a nonempty set, called the interpretation domain, and T is the interpreta- 
tion function that satisfies the following conditions: 

— A G A^ , for every individual i G Ind; 

— zi yf *2 ^ *1 *2; hjr every individuals ii,i2 G Ind; 

— C A^ , for every atomic concept A G Atom; and 

— C A^ X A^ , for every relation R G Rel. 



The interpretation function is extended to arbitrary concepts but the defined 
ones as follows, where R^{a) = {b G A^ \ {a,b) G R^}, and #A stands for the 
cardinality of the set X: 



= A^ 

= 0 

{^AY = 

(Cl n C2Y = C{ n C| 

{^R.Cy = {a G I RYo) C CY 
(< ni RY = {a G A^ I ffR^ (a) < ni} 
(> U 2 R)^ = {a G A^ I ffRYa) > 712} . 



There are different kinds of semantics, in the presence of recursive termi- 
nologies (c.f. [Din]). As it is usual in the cases similar to ours, we shall adopt 
the greatest fixed point semantics, in that it is the one that best captures the 
meaning of the recursive definitions. 
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The intuitive idea behind this semantics is to see a TBox 

T=<Di=Ci,D2 = C2,...,Dr.=Cn > 

as a system of equations, defining the concepts Di, D 2 , ■ ■ ■ , Dn- Let I be an 
interpretation. We can look at the right-hand side of each terminological axiom, 
Cfc, as a function Tj^k that, given n sets Xi,X 2 , . • . , Xn, acting as if they were 
D{, D 2 , . . . , D^, returns the set C^. Note that if we apply simultaneously the 
same input Xi, X 2 , ■ ■ ■ , Xn to the n functions 7/_i, 7/^2, • • ■ , we obtain n 
new sets . . . ,Cn in the output. Therefore, T and / may also be seen as 

characterizing a function 7} whose fixed points are the solutions of the system 
of equations. Roughly speaking, the interpretation of Di, ZI 2 , . ■ . , we are 
interested in is given by the greatest fixed point of this function 7}. Now we will 
formalize these notions. 

Let T — <C D± — ^ 1 : ■ ■ • 5 ^ t )6 cL TBox ciiicl I t )6 ciii iiit 6 rpr 6 tcitioii. 

For every n-tuple x = (Xi, X^) of subsets of let I[x] denote the interpre- 
tation I that associates the set X^ with the defined concept Dk, i.e., = Xk, 

for every k = 1, ... ,n. Then, the function 

Ti : {p{A^)r ip{A^)r 

is called the function defined by T and I. 

We may ask if, for every TBox T and every interpretation 7, the function 
defined by T and I has a greatest fixed point. Notice that the answer is yes if the 
functions 7}, defined on complete lattices, are monotonic. Of course this property 
depends on the underlying concept language but, in our case, the result achieved 
by Nebel ([TT]) for the language NTT allows us to conclude that monotonicity 
holds. The next definition makes use of this fact. 

Let I be an interpretation and 73 be a concept defined in a TBox T. The 
interpretation 7 is extended to 73 — in the context of T — and to (73, T) — in 
the context of any assertion — as follows: 

= (73, T)^ = 

where v is the greatest fixed point of the function defined by T and 7. 

An interpretation 7 satisfies an assertion i : C iff € C^; and satisfies 
R{ii,i 2 ) iff (* 15 * 2 ) £ knowledge base S is satisfiable if there is an inter- 

pretation 7 that satisfies all of its assertions. In this case, 7 is said to be a model 
of E. When E is not satisfiable, we say that E is unsatisfiable. Two knowledge 
bases are said to be equivalent if and only if they have the same models. 

Furthermore, an assertion /3 is a logieal eonsequence of a knowledge base E, 
written E \= j3, every model of E satisfies [3. Notice that any assertion is 
a logical consequence of any unsatisfiable knowledge base and that equivalent 
knowledge bases possess exactly the same logical consequences. 
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2.3 Least Common Subsumer and Most Specific Concept 

Let Cl, C2, and C3 be . 4 £A/’-concepts. We say that C3 subsumes Ci, which is 
represented by Ci E C3, iff C( for all interpretations /. Moreover, C3 is 

a least common subsumer of C\ and C2, written lcs(Ci,C2), iff: 

(i) Cl E C3 and C2 E C3; 

(ii) C3 is a least MEAf-concept satisfying (i), i.e., if C3 is an MEAf-concept such 
that Cl E Cg and C2 E C3, then Cg E C3. 

Let S be an MCAf-knowledge base, C be an MEAf-concept, and i be an 
individual. C is said to be a most specific concept for i w.r.t. S, which is denoted 
by msci;(z), iff: 

(i) E\=i:C; and 

(ii) C is a least concept if C is an M£A/"-concept such that C |= i : C', then 
C EC'. 



Example 2 Let S = {R{i, i), i : (< 1 R)}. It is easy to see that i is an instance 
of any concept of the form Vi?. • • • .Vi?.((< 1 i?) H (> 1 i?)). In order to represent 
all these concepts in a finite way, we need to rely on a recursive definition, such 
as C = (< I R) n (> I i?) n Vi?.C . Thus, 

msci;(i) = (C, < C = (< 1 i?) n (> I i?) n Vi?.C >). 

It turns out that, without recursive definitions, the msci;(j) could not have 
been expressed in A£Af. <0 

In [21 Uni, it is shown that there is always a least common subsumer of two 
MCAf-concepts and a most specific concept for an individual w.r.t. an ALN- 
knowledge base. Both entities can be computed in exponential time by building 
a finite automata. 

3 Combining Knowledge Bases: Three Basic Operations 

To begin with, let us present the declarative semantics of the compositional 
operators: union (V), intersection (A), and renaming (RN). 

So far, the letter S has always stood for a set of assertions forming a knowl- 
edge base. Nevertheless, in the sequel, we shall also use S to denote the name 
of a knowledge base. 

Besides, a renaming p is a total function from Atom U Rel to a set of atomic 
concepts and relations. Atom' U Rel', such that: 

— Atom', Def, Rel', and Ind are pairwise disjoint 

(thus (Atom', Def, Rel', Ind) is a valid alphabet); and 

— Atom n Rel' = Atom' n Rel = 0 

(which means that changes in the word type are not permitted). 
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Starting from a collection of ^/lAf-knowledge bases 27 and a collection of 
renamings p, we define the language of ALH -knowledge base expressions E as 
follows: 

E — > 27 I RN(^Ej p) I El V E 2 \ E\ A E 2 • 

Intuitively, the effect of renaming a knowledge base 27 via a renaming p 
(i.e., RN{E,p)) is that every occurrence of an atomic concept A G Atom (resp., 
relation R G Rel) in 27 is replaced by p{A) (resp., p{R))- Obviously, the renaming 
has no practical effect whenever p is the identity function. 

Defined concepts are not worth renaming. Recall that a TBox 

r = < = Cl , D 2 = C 2 , . . . , > 

acts as an independent system of equations, defining D\, D 2 , . . . , D„ (which are 
all the defined concepts that occur in T), and that the interpretation of a con- 
cept D only depends on the TBox to which it is coupled {{D,T)). Furthermore, 
concepts defined in the same TBox cannot be replaced by the same name be- 
cause, by definition of TBox, definitions are unique. Thus, renaming a defined 
concept would not change its meaning, neither would it change the meaning of 
any other concept occurring in the knowledge base. 

We write p = [di / d'l, . . . , dn/d'^ to make explicit that 

— p{di) = d', for alH = 1, . . . , n, and 

- p{d) = d, for every d^ {di, . . . , d„}, 

where n > 1, {di, . . . , d„} C Atom U Rel, and {d'l, . . . , d'^} C Atom' U Rel'. 
Renamings are extended to arbitrary concepts, assertions, terminological axioms, 
and knowledge bases in the usual way. 

The next step is to provide a compositional declarative semantics for knowl- 
edge base expressions. An A/lTV-knowledge base expression E is characterized 
in terms of a set of A£7V-assertions, by induction on the structure of E: 

RN{E,p) = p{{!3 I Ah/3}) 

Ai V A 2 = {/3 I Ai h /3 or A 2 h /?} 

EiAE 2 = {(3\Ei'^ (3 and E 2 h /?}■ 

Notice that, according to the semantics given before, an A£7V-knowledge base 
27 is equivalent to the set of its logical consequences, {j3 \ E \= j3}. 

At first sight, it seems that this definition can be easily rephrased in terms 
of models. If A4(E) denotes the set of models of an A/lTV-knowledge base 27, 
one could be lead to believe, in particular, that Ai(Ei\/ E 2 ) = M{Ei)(^M{E 2 ) 
and M{Ei A E 2 ) = M{Ei) LI M{E 2 ). Nevertheless, as the next example shows, 
this is not the case. 

Example 3 Let 27i = {i?(zi,Z 2 )} and E 2 = {i?(zi,Z 3 )}. The set of common 
logical consequences of 27i and E 2 is {i\ : (> IR)}- So, in our definition, the 
set of models associated with 27i A E 2 will be M{{ii : (> Id?)}), to which the 
interpretation M such that = {z(^, z^, z|^, a} and R^ = belongs. 

But, M ^ M{Ei) and M ^ M{E 2 ), which implies M ^ M{Ei) L M{E 2 ). <)> 




122 



Aida Vitoria and Margarida Mamede 



Example 4 Consider a knowledge base, S disease, that contains information on 
people with diseases. An excerpt of Sdisease is: 

^disease ■ Joe . (diseaseA, ^7^) 

Tom : (diseaseA, Trf) 

where 

Td =< diseaseA = abnA FI Vchild. (healthy □ Vchild.diseaseA) > . 

To some extent, someone is infected with diseaseA if he has abnormality 
abnA, all his children are healthy, but all grandchildren suffer from diseaseA. 

It could be desirable to combine knowledge bases Sfam and E disease, in order 
to get information about other people affected with diseases. This could be 
achieved, for instance, in the following way: 

RN{Efam, p) V Edis ease 5 

where p = [son/child, daughter/ chi Id]. Then, we would conclude that Fred 
and Kate are healthy, whereas Ann is infected with diseaseA and has the 
abnormality abnA. Note that family relations, like son and daughter, would 
be destroyed (e.g., i?7V(A/am,p) V Adiseose ^ son(Joe, Fred)). 0 

The previous example shows that the renaming operator can be used for 
two distinct purposes: to eliminate vocabulary differences from independently 
developed knowledge bases, or to implement some form of information protection 
strategies. 

Example 5 Let Ehus and Strain be two knowledge bases that contain infor- 
mation about cities directly connected by bus and by train, respectively. An 
excerpt of Sdus could be: 

bus(Montijo, Lisboa) 
bus(Lisboa, Setiibal) 

If we want to know whether a city B can be reached from a city A, by some 
means of transport, then we only need to check if assertion B ; reachable is a 
logical consequence of the following expression: 

i?A^(A6„s, [bus/link]) V i?A^(i7trom, [train/link]) \/ SfromA, 

where 

SfromA = {A : (rea, < rea = reachable □ Vlink.rea >)}. 

Now, suppose that, given any two cities, A and B, we would like to know 
whether there are two types of connections from A to B: one only by bus and the 
other only by train. We could retrieve the desired information from: 

{RN{Sbus, [bus/link]) V SfromA) 

A 

{RN{Strain, [traiu/liuk]) V SfromA)- 
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4 Operational Semantics 

In this section, we discuss a possible implementation of the proposed operations. 
We define a syntactic transformation r that, given a knowledge base expression, 
produces an ^/lAf-knowledge base. Another alternative (not explored in this 
paper) is to implement the language of A/lAf-knowledge base expressions without 
building any new concrete knowledge base. 

The definition of r relies on the following notion. 

Ui V U2 = *2) G Til n S2} 

u 

{i : C \ i occurs in Si and S 2 , C = lcs(msci;^(i),msc£2(0)} ■ 

It is easy to prove that Ei\= f 3 and S2 ^ /?, for every assertion /3 G Ai V A 2 . 

The transformation r can now be defined, by induction on the structure of 
the knowledge base expressions: 

t{S) = S 

t{RN{E,p)) = p(t{E)) 
t{Ei \J E2) = t{Ei) U t{E 2 ) 

) if t{Ei) and t{E2) are both satisfiable, 
if t{Ei) is satisfiable and t{E2) is not, 
if t{E2) is satisfiable and t{Ei) is not, 
otherwise. 

The definition of t{Ei A E2) requires four different cases. When t{Ei) is 
unsatisfiable, every assertion is a logical consequence of Ei. Consequently, the 
logical consequences of E2 are the common logical consequences of Ei and E2, 
and t{E2) may be returned. If neither t{E\) nor t{E2) is satisfiable, then every 
assertion is a logical consequence of Ei and E2- So, every assertion should also 
be a logical consequence of t{Ei A £^ 2 ). This is captured by any knowledge base 
containing an explicit contradiction, i.e., an assertion of the form i : T. 

The next proposition states the correctness of the proposed implementation 
and its proof is presented in [1^ . 

Proposition 1. Let E he an ACM -knowledge base expression and /3 be ACM - 
assertion. Then, E \= fd ^ t{E) H P- 

5 Conclusions and Future Work 

The purpose of the work described in this paper was to introduce three operators 
(union, intersection and renaming) for composing knowledge bases expressed 
in the description logic ACM . To this end, knowledge base expressions have 
been defined with a compositional semantics. A computational interpretation of 
knowledge base expressions has also been formalized through a transformation 



t{Ei a E2) = 
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that maps knowledge base expressions into knowledge bases. A novel aspect of 
this work is that we have applied the notions of least common subsumer and 
most specific concept to the problem of integrating description logic theories. 

Concerning future work, we aim at investigating the following open problems. 

On the one hand, it is possible to extend both the underlying description logic 
and the set of compositional operators. We have already considered the possibil- 
ity of extending the language with relation compositions and relation hierarchies. 
The problem, however, is that it is not possible to give an automata-theoretic 
characterization of the least common subsumer and most specific concept for 
description logics having those operators (an example illustrating this point was 
presented in 1131 1. 

On the other hand, it is interesting to study alternative implementations. In 
this work, we have proposed a compilation oriented approach. Another possible 
implementation could be interpretation oriented so as not to need a new knowl- 
edge base to be built. The main advantage of this setting is that we could cater 
for distributed concept-based knowledge bases. 

Finally, we would like to explore the possibility of applying the results of 
Baader and his colleagues on matching in description logics 00!, in order to 
(automatically) build renamings that match concept names in arbitrary con- 
cepts, including the case of matching recursively defined concepts. 
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Abstract. The accurate translation of collocations, or multi-word units, 
is essential for high quality machine translation. However, many colloca- 
tions do not translate compositionally, thus requiring individual entries 
in the bilingual lexicon. We present a technique for collocation extraction 
from large corpora that takes into account the dispersion of the colloca- 
tions throughout the corpus. Collocations are ranked to more accurately 
reflect how likely they are to occur in a wide variety of texts; collocations 
which are specific to a particular text are less useful for lexicon develop- 
ment. Once the collocations are extracted, appropriate bilingual lexical 
entries can be developed by lexicographers. 

1 Introduction 

Building a bilingual lexicon for large-scale machine translation (MT) requires 
knowledge of the useful units of translation - both single word and multi-word 
entries are required. For lexicalist MT m and m, identification of multi- 
word units, or collocations, in the source language (SL) is particularly crucial 
for coverage, and consequently for translation performance. Any construction 
that is not translatable word-for-word generally requires a separate entry in the 
bilingual lexicon for accurate, non-compositional translation. 

A collocation can be loosely defined as a sequence of lexical items in speech 
or text that occurs more often that would be expected by chance. Contiguous 
and noncontiguous collocations can be distinguished: the former do not permit 
intervening lexical items (eg. good morning), whereas the latter allow such in- 
tervention, possibly restricted to a particular syntactic class (eg. in the support 
verb construction take your turn the possessive adjective is not restricted to 
second person). 

Practically speaking, development of a MT bilingual lexicon is often subject 
to two constraints. First, for certain MT applications, particularly those embed- 
ded in hardware applications, storage is at a premium. Therefore an optimal 
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bilingual lexicon should not contain entries that will be rarely used in the trans- 
lation domain. Second, the development of bilingual lexica is labour intensive, 
and product delivery deadlines limit the amount of effort that can be put into 
this component. The lexicon development schedule should ideally be based on a 
decreasing function of the usefulness of potential entries, where usefulness iden- 
tifies those single words and collocations that occur frequently and recurrently 
in the translation domain. It is clear that lexical development should be guided 
by “what’s out there”. 

The goal of this paper is the automatic identification of useful collocations for 
bilingual lexical development from a noisy domain language corpus. Conventional 
approaches to collocation extraction treat a corpus as an indivisible whole - 
when in fact corpora are comprised of a large number of texts - and therefore 
do not distinguish collocations that are found in a small number of texts from 
those that are spread more evenly throughout the corpus. For time- and space- 
constrained lexicon development, the latter type of collocation is more useful. 
If completeness is not practically attainable, then informed choices have to be 
made about the entries to include. The method we describe in this paper ranks 
collocations according to their estimated usefulness for our MT application. 

We collected a corpus of 11.2 million words from our translation domain: 
television closed captions. The corpus consists of the transcribed dialogue from 
a diverse range of television programs, scattered throughout with advertising 
(although most advertisements were not closed captioned). 

Because our corpus is made up of a large number of short “texts” , it contains 
words that may have a high absolute frequency, yet are only found in a single 
television program. Proper names are the canonical example of this phenomenon, 
which has been referred to as the “dumpiness” or “burstiness” of words (HI)- 

For the purposes of building a bilingual lexicon, words like these which are 
specific to a particular text are not as useful as those with the same absolute 
frequency, yet are spread more evenly throughout the corpus. Several researchers 
(eg. Cl) have proposed that word frequencies should be adjusted to take such 
behaviour - their dispersion - into consideration. An adjusted frequency can be 
viewed as a better estimate of a word’s “true” frequency in a corpus of infinite 
size. It is evident that the same rationale can be applied to collocations: if a 
certain collocation is only found in a small number of texts in the corpus, then it 
is not as useful for a bilingual lexicon as an equally common, but better-dispersed 
collocation, and consequently its frequency should be adjusted to compensate. 

It is with this last point in mind that we developed the procedure for discov- 
ering useful collocations described in the current paper. The rest of this paper 
is laid out as follows: first, we briefly describe our translation system. Next, we 
describe the statistical techniques employed in order to discover useful colloca- 
tions, taking into account their dispersion throughout the corpus. Finally, we 
present a graphical evaluation of the results of the procedure. 
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2 The Translation System 

ALTo is an MT system for the multilingual translation of Colloquial English 
as found in closed caption text m and [TO]). Closed captions are transmitted 
with the vast majority of North American television programs and all major 
release videos. Currently, modules for translation from English to Spanish and 
Portuguese have been developed. 

ALTo uses a lexicalist transfer approach to MT. The main peculiarity of this 
approach is that no structural transfer is performed. Transfer is a mapping be- 
tween bags of lexical items; transfer uses a bilingual lexicon as its only repository 
of bilingual knowledge. The translation of complex, non-compositional expres- 
sions relies on the availability of multi-word bilingual lexical entries mapping 
source complex expressions onto equivalent target expressions. 

Compared with other varieties of English, Colloquial English is particularly 
rich in idiomatic expressions (e.g. cool it, be in the clear, come clean, right up my 
alley). On one hand, the lexicalist approach is attractive for translating idiomatic 
expressions because of the simplicity and the ease of implementation of bilingual 
equivalences, on the other hand it is clear that accurate translation requires a 
comprehensive bilingual lexicon. For example, verb-object combinations are one 
type of useful collocation to include in a bilingual lexicon. Compound nouns are 
another frequent type of collocation in English (eg. wake-up call), as are verb- 
particle combinations such as check out, and prepositional verbs such as rely on 
and account for. 

We have developed a process for the semi-automatic creation of bilingual 
lexical entries. The process combines manual lexicographic work with the use 
of automatic tools that reduce the lexicographers’ workload. The input to the 
process is a collection of English words or expressions for which bilingual entries 
need to be created. The output of the process is a collection of complete bilingual 
entries that can be used by the ALTo system. The lexical development process 
is comprised of three steps: 

1. The lexicographers provide translations for the input collections of English 
expressions, in terms of simple target language word sequences. 

2. The word equivalences provided by the lexicographers are fed into an appli- 
cation, which automatically provides one or more complete bilingual entries, 
in the format used by the ALTo system, for each input word equivalence. 

3. The collection of candidate bilingual entries is returned to the lexicographers 
for validation. The validated entries are added to the bilingual lexicon. 

The collocation discovery procedure described in this paper is the neces- 
sary preliminary step to the bilingual development process. Since the discovered 
collocations are evaluated by lexicographers, who select only the linguistically 
relevant ones, it is not crucial that the collocations list contain only meaningful 
collocations. Some amount of useless material is tolerated, as the overhead in 
terms of lexicographic work is not too high. Also, given the constraints on the 
bilingual development process in terms of development time and storage limita- 
tions, it is not required that all the meaningful collocations are returned. Hence, 



Collocation Discovery for Optimal Bilingual Lexicon Development 129 



the stress is neither on precision or recall. Ideally, the collocation discovery pro- 
cedure should return all and only the most useful collocations. Practically, what 
is important is to maximize the coverage provided by each collocation that ends 
up in the bilingual lexicon. 



3 Collocation Discovery 

In this section we describe our approach to the problem of how to determine 
the useful collocations to include in the bilingual lexicon. Previous research in 
automatic collocation discovery has employed statistical measures (eg. |2], [S]i 
g] and 0), all sharing the general approach of finding collocations by comparing 
the independent probabilities of each component of the collocation to their joint 
probability, but to our knowledge none have considered the corpus dispersion 
issue described in Section [H Our approach is to first calculate the dispersion of 
a collocation in order to adjust its co-occurrence frequency, and then apply the 
log-likelihood ratio statistic in order to rank candidate collocations according 
to “interestingness”. Note that a good indicator that the procedure is working 
properly is the successful retrieval and high ranking of collocations peculiar 
to commercial advertising. Because of broadcasting regulations and practice, 
commercials ended up being dispersed quite evenly throughout our corpus. 

We preferred to work with lexemes rather than surface word forms, since 
conflation of the co-occurrence counts for different inflected forms of the same 
word increases the number of data points for each type, compensating some- 
what for the small size of the corpus. In addition, we avoid redundancy when 
creating bilingual entries, since the entries themselves specify a correspondence 
between base forms. Also for lexical development considerations, we preferred to 
use lexeme-tag pairs (lexeme-with-tag combinations such as cold/ADJECTIVE) 
rather than simple lexemes as the unit of analysis. This allowed easy distinctions 
to be made between, for example, uses of the same verb in different complemen- 
tation patterns. The value of these distinctions relies heavily on tagging accuracy, 
however. The procedure we describe in this paper can in principle be used with 
any kind of corpus, tagged or untagged. 



3.1 Tagging Phase 

After some simple manual cleaning, which involved removing long sequences of 
random characters - a side-effect of the automatic caption capturing process - 
our corpus consisted of 11,565,645 tokens. Each word in the corpus was first 
assigned a part-of-speech label using a statistical tagger. We developed a tagset 
that has a close correspondence to the syntactic category specifications used in 
the English grammar module of the ALTo system and thus the English side 
of entries in the bilingual lexicon. A subset of the tags loosely correspond to 
standard part-of-speech labels such as “noun”, “adjective”, and “conjunction”. 
The remainder of the tagset is based on the Oxford Advanced Learner’s Dic- 
tionary (OALD) classification system |B], which makes finer-grained distinctions 
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than conventional tagsets. For example, OALD makes fine-grained distinctions 
for verbs with regard to valency and complementation. Verbal tags are classi- 
fied into about 20 separate types, distinguishing intransitive from transitive and 
ditransitive, subject control, prepositional complements, etc. This augmented 
tagset allows us to map directly from statistical analysis of the tagged corpus to 
word and category specifications in the bilingual lexicon, automatically creating 
the SL side of the bilingual lexical entries. 

3.2 Lemmatizing Phase 

Lemmatization of the tagged corpus was done next, in order to eliminate redun- 
dant entries from being generated. Inflected word forms were replaced by their 
citation form. This was accomplished by simply looking up the tagged word 
form in the lexicon. Note that the tagging phase also effectively acted as a mor- 
phological disambiguator, in that part-of-speech ambiguous words such as plan 
are resolved to either noun or verb. After tagging and lemmatizing, our corpus 
consisted of 94,016 unique lexeme-tag pairs. 

3.3 Estimating “True” Frequencies 

A raw frequency list was produced for lexeme-tag pairs. Next, the frequencies 
for each lexeme-tag pair were adjusted according to their dispersion in the cor- 
pus. Adjusted frequency can be considered as an estimate of a lexeme-tag pair’s 
“true” frequency in natural language. Dispersion is a measure of relative en- 
tropy that quantifies how evenly a lexeme-tag pair is distributed throughout the 
texts making up a corpus PQ. We operationally define a “text” as a chunk of 
10,000 words0 Our IIM word corpus thus consists of 1,157 “texts”. Following 
jl], dispersion was calculated as follows: 

> Pi log Pi 
— 

log n 

where: n = number of texts, pi = probability of a token occurring in text i, 
which is defined as the frequency of the token in text i divided by the number 
of tokens in text i {pilogpi=0 for pi=0 by continuity). 

Adjusted frequency Fa of a token is calculated as: 

Fa = FD+{1-D)frmn 

where F = raw word frequency, D = Dispersion score. 

fmin is a weighted sum of the product of the frequency of the token in each 
text i and the number of tokens in that text: 

^ 10,000 words is a rough estimate of the average length of a transcribed television pro- 
gram. This is a parameter in the formula: using a larger text size will de-emphasize 
the text-specificity of words and collocations in the corpus, consequently their fre- 
quencies will be adjusted less. Ideally, the corpus should be marked up with text 
beginning- and end-points defined. 
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Collocation Discovery for Optimal Bilingual Lexicon Development 131 

fmin — ^ fi tokCflSi 

where N = corpus size in tokens. 

The utility of the adjustment method can be seen in Table [1] The disper- 
sion scores for three words of comparable raw frequency {creep, amendment, and 
milord) reflect their text-specificity; note that the adjusted frequency for amend- 
ment is substantially lower than its raw count. The NOUN-tagged milord is an 
even better example: like barrow, it is something that would be of minimal inter- 
est for lexicon construction, and its inclusion in the bilingual lexicon should be 
deprioritized accordingly. Note that the frequency of the better-dispersed word 
creep is adjusted only moderately in comparison. Cold has the highest dispersion 
score of any of these examples, and its frequency is adjusted the least. 



Lexeme-tag Pair Raw Count Dispersion Adjusted Freq 



cold/ADJECTIVE 


1271 


0.8791 


1117 


barrow/NOUN 


49 


0.3980 


20 


creep/NOUN 


149 


0.6749 


101 


amendment /NOUN 


143 


0.5715 


82 


milord/NOUN 


148 


0.3858 


57 



Table 1. Examples of Applying Dispersion to Raw Counts. 



We removed all lexeme-tag pairs having an adjusted frequency of less than 5 
from further consideration, leaving 21,738 unique lexeme-tag pairs. We reasoned 
that collocations containing rare lexemes would not be useful for the bilingual 
lexicon, and removing them reduced the computation required for further pro- 
cessing]^ 

The dispersion formula was next used to adjust co-occurrence frequencies 
in exactly the same way as was done for lexeme-tag frequencies. This allows 
a distinction to be made between frequent collocations that are confined to a 
small number of texts (and so are of minimal interest to lexical development) 
and collocations of equivalent frequency that are spread more evenly through 
the corpus. 



3.4 Two- Word Collocations 

We assessed the “interestingness” of a candidate collocation using the log-likeli- 
hood ratio statistic, which tests the independence of the counts of the collocation 
components (see |5] and 0)- The comparison of co-occurrence frequency with 
the independent frequencies was done through the use of contingency tables. An 
example contingency table is shown in Table[2l for the 2-word VERB-PARTICLE 



^ This filtering is consistent with [9], who found that low-frequency words were not 
productive components of collocations. 



132 



Scott McDonald et al. 



collocation stick up. The variables X and Y represent events; for example they 
can stand equally well for single lexeme-tag pairs or 2-word contiguous colloca- 
tions. The log-likelihood ratio was then computed to estimate the “interesting- 
ness” of event X occurring together with event Y. Because adjusted frequencies 
are used in three of the four cells, and adjusted frequencies are nearly always 
smaller than raw frequencies, the value of the fourth cell (^X ^Y) will be larger 
than the actual corpus count. This value can be considered as an estimate of the 
“true” ^X ^Y count in an infinite corpus. 





X -iX 


Y 


32 8,653 


^Y 


49 10,499,433 



Table 2. A 2-by-2 Contingency Table Showing the Dependence between X=stick 
and Y=up. 



3.5 Three- Word Collocations 

We consider a 3-word collocation to consist of two events: a single lexeme-tag 
pair (a unigram) and a 2- word contiguous collocation (a bigram). This defi- 
nition allows several configurations, depending on the order of the events and 
whether the second event is associated to the first within a window of words. 
The possible collocation patterns are unigram-l-bigram, bigram-l-unigram, un- 
igram... bigram and bigram... unigram, where the latter two patterns describe 
co-occurrence within a window. The log-likelihood ratio was used to rank the 
interestingness of 3-word collocations in exactly the same way as was done for 
2-word collocations. Retrieving 4-word collocations turned out to be unneces- 
sary, as we observed that nearly every 4-word collocation in our corpus turned 
out to be better described as a 3-word noncontiguous collocation with a variable 
element (most often a function word such as the) in an intermediate position. 

4 Analysis 

The retrieval procedure was run for a wide variety of contiguous and noncon- 
tiguous collocations. The collocation patterns searched for were determined by 
examining the types of collocations already present in the bilingual lexicon, 
together with informal examination of the log-likelihood ratio ranked lists of 
discovered collocations. Table El shows some sample output. These are the top 
ranked VERB2D (intransitive inchoative verb taking an adjective complement) 
plus ADJECTIVE contiguous collocations. 

These collocations are all useful for our bilingual lexicon, as they generally 
cannot be translated compositionally into Spanish. Tabled presenting the top 
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make/VERB2D possible/ADJECTIVE 
get/VERB2D ready/ADJECTIVE 
get/VERB2D married/ ADJECTIVE 
look/VERB2D good/ ADJECTIVE 
make/VERB2D sure/ ADJECTIVE 
look/VERB2D great/ADJECTIVE 
feel/VERB2D better/ADJECTIVE 
feel/VERB2D good/ ADJECTIVE 



Table 3. Top ranked VERB2D-ADJECTIVE contiguous collocations 



ranking ADJECTIVE-NOUN-NOUN contiguous collocations, shows the effec- 
tiveness of the dispersion technique: national captioning institute is a footer 
inserted by the closed captioning service that occurs in nearly every text, and 
therefore it receives top ranking. Another interesting observation is that 6 of the 
top 20 are collocations clearly occurring in advertising. 



national /ADJECTIVE 
paramount /ADJECTIVE 
new/ADJECTIVE 
capital/ ADJECTIVE 
twentieth/ ADJECTIVE 
good/ADJECTIVE 
last/ADJECTIVE 
first/ADJECTIVE 
one/ADJECTIVE 
financial/ ADJECTIVE 
American /ADJECTIVE 
helpful/ ADJECTIVE 
broadcasting/ ADJECTIVE 
spicy/ADJECTIVE 
only/ADJECTIVE 
daily/ ADJECTIVE 
new/ADJECTIVE 
honorable /ADJECTIVE 
alpine/ ADJECTIVE 
whole/ADJECTIVE 



captioning/NOUN 

picture/NOUN 

York/NOUN 

CITIES/ABC/NOUN 

century/NOUN 

morning/NOUN 

time/NOUN 

time/NOUN 

thing/NOUN 

support/NOUN 

express/ NO UN 

hardware /NOUN 

company/NOUN 

chicken/NOUN 

thing/NOUN 

UV/NOUN 

York/NOUN 

ELIJAH/NOUN 

mint/NOUN 

grain/NOUN 



institute /NOUN 

corporation/NOUN 

city/NOUN 

Inc/NOUN 

fox/NOUN 

America/NOUN 

I/NOUN 

I/NOUN 

I/NOUN 

FROM:/NOUN 

TRAVELERS/NOUN 

folk/NOUN 

Inc/NOUN 

sandwich/NOUN 

I/NOUN 

PROTECTANT/NOUN 
time/NOUN 
Muhammad /NOUN 
fiavor/NOUN 
wheat/NOUN 



Table 4. Top ranked ADJECTIVE-NOUN-NOUN contiguous collocations 



Our evaluation is based on the preliminary results presented in Table [SI for 
the three collocation patterns whose evaluation has been completed to date by 
the lexicographers (VERB 1- ADJECTIVE, VERB2D- ADJECTIVE and NOUN- 
NOUN). 
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abed e 



1-ADJ 401 40 335 385 375% 
2D-ADJ 237 10 197 200 207% 
N-N 5096 484 1528 1528 2012% 



Legend: 

a 



number of collocations extracted from corpus 

number of extracted collocations already present in bilingual lexicon 
number of extracted collocations validated and coded by lexicographers 
number of translations (possibly, more than 1 per collocation) 
number of good collocations extracted (b + c) 



b 



c 



d 



e 



Table 5. Preliminary results. 



Evaluation of the usefulness of the retrieved and ranked collocations is a 
difficult issue. Since it is not possible to calculate recall (this would require a 
manual count of the useful collocations present in the corpus), a precision score 
is not meaningful. Precision depends entirely on the thresholding of the adjusted 
co-occurrence frequencies: the higher the threshold, the better precision will be 
because the most useful collocations are generally found in the top part of the 
ranking. 

Instead, we based the evaluation of our procedure on the notion of utility rate. 
We examined how many of the retrieved collocations were subsequently included 
in the bilingual lexicon. Usefulness was determined by one of two lexicographers 
writing the bilingual lexicon; collocations that translate compositionally were 
not considered useful. The utility rate is the ratio of useful collocations to the 
cumulative collocations count in the ranked lists for each collocation pattern. 
If every collocation in the ranked list was included in the bilingual lexicon, the 
ratio would be 1. Graphically, with useful collocations plotted against cumulative 
collocations, the result would be a 45 degree line. We compared the utility rate 
for collocations ranked using our dispersion method to the ranking obtained by 
not adjusting event frequencies for dispersion, for each collocations pattern. In 
comparing the two methods, we make the assumption of completeness - all useful 
collocations were discovered in the “dispersion” list, and therefore everything else 
is not useful for a bilingual lexicon. This assumption is clearly suboptimal, but 
we have not yet had the “non-dispersion” lists assessed. 

The results are shown in Figure[H In each chart, the solid line represents the 
dispersion method, the dark dotted line represents the non-dispersion method 
and the light dotted line represents the ideal curve. Note that, for comparison 
purposes, the lists obtained with the non-dispersion method were normalized to 
the length of their dispersion counterparts. A first result shown by the figures is 
that the dispersion method consistently extracts useful collocations that would 
not be top ranked without the adjustment for dispersion. 
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VERB 1- ADJECTIVE 




VERB2D- ADJECTIVE 




NOUN-NOUN 




Fig. 1. Number of useful collocations (x-axis) plotted against cumulative collo- 
cation count (y-axis). Utility rate for the dispersion method is indicated by the 
solid line, the non-dispersion method by the dark dotted line. 



Although a direct comparison between the two methods (dispersion vs. non- 
dispersion) could not be done, due to the lack of data explained above, we per- 
formed an indirect comparison. Each dispersion list of collocations was compared 
with the intersection between the dispersion list itself and its non-dispersion 
counterpart. In other words, we compared each dispersion list with its own non- 
dispersion portion. The results are shown in Figure [2l where, again, the solid 
line represents the dispersion method and the dark dotted line represents the 
non-dispersion method. There is no visible difference in utility rate between the 
two methods. 
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VERB 1- ADJECTIVE 




VERB2D- ADJECTIVE 





Fig. 2. Utility rate comparison between the dispersion list and the intersection 
of the dispersion and non-dispersion lists. 



What these results suggest is that the dispersion method replaced a number 
of useful collocations with an approximately equal number of useful collocations. 
However, due to the dispersion adjustment, the latter are more likely to have a 
wider semantic domain, thus providing larger overall coverage. 



5 Conclusions 

We have implemented a procedure for automatically retrieving collocations from 
a corpus of television closed captions. The discovery procedure adjusts colloca- 
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tion frequencies to take into consideration their dispersion throughout the cor- 
pus. This results in the ranking of collocations according to their usefulness, 
with lower ranking assigned to collocations specific to a small number of texts. 
We argue that this method is of value for optimising the development of a wide 
semantic domain bilingual lexicon. 

We are planning to perform two further tests in order to confirm the useful- 
ness of the proposed approach. A first test is being designed in order to show that 
the differences between the dispersion and the non-dispersion methods are sta- 
tistically significant and are not due to chance. A second test will be conducted 
on a new corpus, in order to confirm that the dispersion method provides larger 
overall coverage than the non-dispersion method. 
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Abstract. We show a diagnostic evaluation of DIPETT, a broad-coverage 
parser of English sentences. We consider the TSNLP suite as a diagnostic tool, 
and propose an alternative broader-coverage test suite of test sentences 
extracted from Quirk et al. We compare the diagnostic effectiveness of the two 
suites, and draw a few general conclusions. The evaluation results were used to 
make significant improvements to DIPETT. 



1 Introduction 

Test suites have long been accepted in NLP because they provide controlled data that 
is systematically organized and documented. We investigate the use of test suites for 
evaluating the performance of broad-coverage parsers of English sentences, and use 
two test suites to evaluate the DIPETTQparser [8, 9] of the TANKA^project [4, 10, 
11]. The TANKA project seeks to build a model of a technical domain by semi- 
automatically processing written text that describes the domain. No other source of 
domain-specific knowledge is available. The results of the evaluation were used to 
make significant improvements to DIPETT, a surface syntactic parser. 

We show that a test suite for a broad-coverage natural language parser must 
necessarily be systematic, broad in its coverage of phenomena tested, and corpus-like 
in its coverage of phenomenon interaction. A test suite of example sentences 
extracted from Quirk et. al.’s comprehensive English grammar [15] is proposed, and 
the results of evaluating DIPETT on that suite are conroared with the evaluation 
results on another publicly available test suite, TSNLl[j Both test suites and the 
parser grammar were all based on the same widely acknowledged theory-neutral 
grammar. Although this work was performed on a single parser, the TSNLP suite is a 
generally known and useful resource, and so these findings are generally applicable 
and valuable. 



* Domain Independent Pai'ser of English Technical Text 
^Text Analysis for Knowledge Acquisition 
^ Test Suites for Natural Language Processing 

H. Hamilton and Q. Yang (Eds.): Canadian AI 2000, LNAI 1822, pp. 138-150, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 
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1.1 Background 

Parser Evaluation. Over the past few years, there has been increased interest in the 
evaluation of natural language processing tools. Carroll et. al. [6] and Srinivas et. al. 
[17] provide surveys of current parser evaluation methods, and describe non-corpus 
and corpus-based evaluation methods that are suited to different types of evaluation. 
The HLT Survey - Survey of the State of the Art in Human Language Technology [7] 
presents a useful and descriptive introduction to NLP evaluation terminology, 
including diagnostic evaluation. 

Diagnostic evaluation tests system performance with respect to the space of 
possible inputs. In NLP application areas where coverage is important, particularly 
systems with explicit grammars, large test suites are commonly used for diagnostic 
evaluation. The purpose of a test suite is to enumerate linguistic phenomena in the 
input domain and their most likely or important combinations. Possible limitations of 
diagnostic evaluation are (1) “such test suites may not reflect the distribution of 
linguistic phenomena in actual application domains” [6], and (2) performance on a 
test suite does not provide information about coverage of English sentences in real 
text. Rare or marginal constructions might not be covered by a grammar or the test 
sentences. In addition, coverage of various constructions does not guarantee coverage 
of their interaction in combination with others. 

The TSNLP Project. The TSNLP project ran in 1993-1996. The purpose of the 
project was to investigate all aspects of the construction, maintenance, and application 
of systematic test suites as diagnostic and evaluation tools for NLP applications. The 
TSNLP project produced much insight into the nature of test suite design and 
delivered three large, publicly available, multi-purpose test suites in English, Erench, 
and German [13]. 

The TSNLP project [3] defines a test suite for NLP as “a more or less systematic 
collection of specially constructed linguistic expressions (test items, phrases or 
sentences), perhaps with associated annotations and descriptions.” An important issue 
in the TSNLP design was to consider what exactly are the differences between a test 
suite and a corpus [1, 2]. In general terms, test suites are constructed of systematically 
chosen test items, while a corpus consists of selections of naturally occurring text. Eor 
certain purposes, test suites provide a more direct tool [3]. Some of the differences 
between test suites and corpora discussed by Balkan [3] are as follows: 

• Control over test data 

• Systematic coverage of phenomena 

• Non-redundant representation of phenomena 

• Negative examples 

• Annotation of test items 

The TSNLP method is designed to optimize control over test data, progressivity, 
and systematicity [13] in order to construct a test suite that is adequately broad- 
coverage, multi-purpose, multi-user, multi-lingual, reusable, and has the advantages 
over corpora as described above. The TSNLP test suite covers a range of linguistic 
phenomena including complementation, modification, diathesis, modality, tense and 
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aspect, clause type, coordination, and negation. The TSNLP data also include a 
detailed dependency-based annotation schema that allows mapping onto constituent 
structures [12]. 

Previous Test Studies of TANKA. In 1996 and 1998 Barker et. al. published [5, 10] 
then-current experimental results of TANKA’ s performance in extracting knowledge 
from two selected texts. In the first experiment, a simple children’s science book 
describing a technical domain, the weather, was parsed by DIPETT. A syntactically 
more complex text was chosen for the second experiment, a book on the mechanics of 
small engines. These experiments did not, and were not intended to, constitute a full- 
fledged diagnostic evaluation of DIPETT. 

1.2 Proposal for a Broader-Coverage Test Suite 

Considerations of Test Suites versus Test Corpora. We considered the benefits and 
drawbacks of using a corpus, rather than a test suite, for diagnostic evaluation. Use of 
a corpus for testing of coverage was rejected for the following reasons: 

• A corpus does not guarantee broad-coverage. 

• A corpus lacks focus on particular phenomena. 

• A corpus does not provide systematic testing of variations over a specific 
phenomenon. 

• A corpus lacks systematic testing of the co-occurrence of different constructions. 

• Typically, corpora are large. 

• A corpus may provide redundant coverage of phenomena, in that the phenomena 
that do occur may occur repeatedly. 

Using the TSNLP Suite. The TSNLP group has provided an excellent test suite for 
diagnostic evaluation. Certainly, a broad-coverage parser should be able to parse all 
of the TSNLP test sentences. However, the breadth of TSNLP test suite has 
limitations that arose from the design requirements: 

• The lexicon is limited. 

• The linguistic phenomena covered by the test sentences were chosen based on 
linguistic relevance and frequency across three languages, not just English. 

• Interaction of phenomena is highly controlled. 

In short, the TSNLP suite may be useful for highlighting deficiencies in a 
grammar, but it lacks the benefits of parsing corpus data, and the coverage of the 
downloaded database test items cannot be considered truly broad unless it 

were extended, as envisaged by its design. Still, a diagnostic evaluation would be 
incomplete if the TSNLP data were not considered. 



The TSNLP database at http://tsnlp.dfki. uni-sb.de/tsnlp/tsdb/tsdb.cgi?language=english 
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Proposal for a Quirk-Based Test Suite. To augment the coverage provided by the 
TSNLP suite, we proposed to design a collection of test sentences that was annotated 
and organized according to a formal English grammar description. Quirk et. al.’s 
grammar [14, 15] was chosen as the source. We extracted the example sentences and 
the ill-formed sentences from Quirk et. al, and organized them by topic. These were 
then used to test DIPETT’s performance on the distribution of syntactic structures. 

Advantages. The advantage of this method was that it drew upon the strengths of both 
test-suite methods and corpus-like methods: 

• The test sentences are not derived from DIPETT’s grammar. 

• They provide broad coverage of linguistic phenomena and are indexed according 
to linguistic phenomena. 

• They are rich in syntactic structure and test the distribution of syntactic structures. 

• The grammar is buttressed by an authority. 

• The test sentences use an unconstrained vocabulary. 

• The coverage of phenomena interaction is corpus-like in that it is not controlled as 
in a test suite. 

Disadvantages. The disadvantages of using the Quirk sentences as a diagnostic tool 
are as follows: 

• The lexicon is not limited and is therefore harder to manage. 

• Sentences contain phrases that are irrelevant to the phenomenon described. 

• Redundant structures and variable depth: some sections have many examples, 
some have very few. 

• The complete collection is large. 



2 A Diagnostic Evaluation of DIPETT 

The Quirk test sentences were extracted from Quirk et. al. [15] chapters 2 and 3. The 
chapter 2 sentences cover a general outline of English grammar and of its major 
concepts and categories. The chapter 3 sentences cover the grammar of verb phrases. 
These sentences were augmented with selections from chapter 5 of Quirk et. al. [15] 
that illustrate the basic constituents of a noun phrase. Both grammatical and 
ungrammatical example sentences were included. 

The complete tsdb( 1 ) data were downloaded and stored as a Microsoft Access 
database. Queries were developed to extract test items and their related phenomena 
and analysis. The database is large, 4612 sentences and sentence fragments in all. The 
data used was restricted to the set of test items that are complete sentences and clearly 
either grammatical or ungrammatical (well-formedness code 0 or 1). The extraction 
produced 1173 grammatical and 1535 ungrammatical sentences for the TSNLP test 
items. 

All of the sentences were prepared for batch parsing by DIPETT, and a lexicon of 
possible-part-of-speech entries for unknown words was produced for each batch file. 
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2.1 Summary of Parse Results 



Grammatical Sentences 

summarized in 



Table 1 



The parse results for the grammatical sentences are 
Coverage is the percentage of examples from a given corpus 
for which a parser/grammar is able to assign at least one complete (not fragmentary) 
analysis. DIPETT produced full parse trees for 79% of the grammatical Quirk 
sentences but only 65% of the grammatical TSNLP sentences. This result was 
unexpected because the TSNLP sentences are shorter, simpler, and narrower in their 
coverage of phenomena and phenomena interaction than are the Quirk sentences. 

Ungrammatic al Sentence s. The parse results for ungrammatical sentences are 
summarized in 



Table ij Ungrammatical sentences require special treatment by 



parser. DIPETT generated a fragmentary parse for 38% of the Quirk sentences and 
53% of the TSNLP sentences. It was found that many of the ungrammatical sentences 
were not useful for diagnostic evaluation of DIPETT; ungrammatical sentences are 
useful only for testing phenomena that are specifically rejected by the grammar. Eor 
DIPETT, ungrammatical sentences are useful for testing such phenomena as number 
agreement but not phenomena such as complementation. Sometimes the full parse 
trees generated for ungrammatical sentences are perfectly reasonable, given the 
information DIPETT has available. This is particularly true for the Quirk sentences, 
where some of the ungrammaticality is purely semantic. 



Table 1. DIPETT v3.0 — Parse Results for Grammatical Sentences 



Grammatical Sentences 


Quirk et. al. chapter 


TSNLP 


2 


3 


5 


Number of grammatical sentences 


252 


308 


18 


1143 


Number of full parses generated by 
DIPETT 


218 

(86%) 


221 

(72%) 


18 

(100%) 


745 

(65%) 


Number of correct full parses 


120 

(48%) 


117 

(38%) 


18 

(100%) 


467 

(41%) 



Table 2. DIPETT v3.0 — Parse Results for Ungrammatical Sentences 



Ungrammatical Sentences 


Quirk et. al. chapter 


TSNLP 


2 


3 


J 


Number of ungrammatical sentences 


22 


44 


10 


1519 


Number of fragmentary parses 


1 

(32%) 


17 

(39%) 


5 

(50%) 


800 

(53%) 


Full parses of ungrammatical 
sentences 


15 

(68%) 


27 

(61%) 


5 

(50%) 


719 

(47%) 
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Table 3. DIPETT v3.0 — Summary of Results on TSNLP Phenomena 



TSNLP Phenomenon 


Number of 
Sentences 


Fragmentary 

Parses 


Correct 
Full Parses 


Incorrect 

Full 

Parses 


C_Agreement-Collectives 


2 


0 


2 


0 


C_Agreement- 

Coordinated_subi(correlatives) 


12 


6 


6 


0 


C_Agreement-NP_V 


2 


0 


2 


0 


C_Agreement-Parentheticals 


6 


3 


3 


0 


C_Agreement-Paititives 


17 


11 


5 


1 


C_Agreement-Pron_V 


14 


0 


14 


0 


C_Agreement-Relative_clauses 


7 


1 


7 


0 


C_Complementation-Divalent 


43 


1 


41 


1 


C_Complementation-Equi 


9 


2 


0 


7 


C_Complementation-Monovalent 


2 


0 


2 


0 


C_Complementation-Raising 


17 


4 


0 


13 


C_Complementation-Tetravalent 


2 


1 


0 


1 


C_Complementation-Trivalent 


75 


2 


44 


29 


C_Tense- Aspect-Modality 


157 


45 


73 


39 


C_Negation-Tense_aspect_modality 


287 


126 


90 


72 


C_Parentheticals-Embedded 


1 


1 


0 


0 


NP_Coordination 


135 


0 


135 


0 


NP_Modification-Relative_clauses 


144 


132 


0 


12 


NP_Parentheticals 


8 


8 


0 


0 


S_Parentheticals-Punct_interaction 


3 


3 


0 


0 


S_Types-Questions-Wh 


37 


13 


23 


1 


S_Types-Questions-Y/N 


23 


4 


19 


0 


S_Types-Questions-Y/N_questions- 

Non_inverted-Non-tagged 


14 


14 


0 


0 


Total 


1017 


377 


466 


176 



2.2 Evaluation of DIPETT’s Performance on the TSNLP Sentences 



I Table 3| shows a complete summary of DIPETT’s parse results on TSNLP 
grammatical sentences grouped by TSNLP phenomena. A number of problems were 
identified. 



Clause Agreement Phenomena. DIPETT’s parse trees for the TSNLP clause 
agreement test items contained a number of errors. DIPETT does not recognize 
collective nouns as being plural. DIPETT’s recognition of correlative conjunctions is 
insufficient, for example, '"both.. and" is not recognized. It should be noted that the 
coverage of correlative conjunctions in the TSNLP test suite is not complete either. 
DIPETT does not recognize many partitives or phrasal quantifiers. 

Monovalent and Divalent Complementation. In contrast to Quirk et. ai, the 
TSNLP group defines complementation to include the subject as well as objects. 
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prepositions, and obligatory adverbs. DIPETT’s internal dictionary classifies 
monovalent (intransitive) and divalent (transitive) verbs, and DIPETT parsed these 
sentences very well. The only divalent phenomena that DIPETT failed to parse 
properly were the subjunctive and a sentence containing an intransitive phrasal verb. 

The TSNLP suite has some test items for adverbial particles. DIPETT has 
provisions for phrasal verbs and the built-in dictionary contains many adverbial 
particle entries, but the feature is essentially “turned off’ and not used. Preference is 
given to nil verbal particle; particles are almost always parsed as a simple adverb, and 
not as particles. 

Trivalent and Tetravalent Complementation. DIPETT did not perform well on the 
trivalent complementation tests. DIPETT does not properly recognize object 
complements, obligatory adverbial complements, obligatory ditransitives, complex 
transitives, or transitive phrasal verbs. Sometimes DIPETT recognizes wh- 
complements, but not always; w/i-complements may be misparsed as adverbials. 
DIPETT cannot distinguish between complementation patterns obj+pp and iobj+pp, 
but that would be difficult for a semantically weak parser to achieve. Tetravalent 
complementation is not a common complementation pattern; there are only two 
example sentences in the TSNLP data. DIPETT was unable to generate a full parse 
tree for one of those sentences, and the other was not parsed as tetravalent. 

Equi and Raising. “Raising” of the object is not shown in DIPETT’s parse trees. 
Neither are the “equi” clausal complementation phenomena Equi-Obj -control, Equi- 
Pobj-control, and Equi-Subj-controI shown in DIPETT parse trees. The tests for these 
phenomena, described by Soames and Perlmutter [16], require that the referential 
subject of an infinitive or to-infinitive complement be identified. 

Active, Passive, and Middle Diathesis. The test items for active diathesis 
phenomena are a subset of the complementation test items described above. DIPETT 
recognizes passive diathesis only when the passive agent is specified. If the passive 
subject is nil, then the sentence is interpreted as stative. DIPETT is not aware of verbs 
that cannot be passivized. DIPETT does not recognize the passive auxiliary get. 

Coordination Phenomena. DIPETT was able to parse all of the noun phrase 
coordination examples. The grammatical test items for case assignment in the subject 
position contain a number of examples having conjoins in the accusative case, for 
example, “Me and him succeed.’’" DIPETT accepted all of these constructs and 
generated full parse trees. 

Parenthetical Phenomena. DIPETT is extremely sensitive to punctuation and failed 
to parse all of the parenthetical test items. The TSNLP noun phrase parenthetical tests 
items are marked by square brackets, single dashes, and round brackets with a single 
comma. DIPETT can parse these tokens with some minor editing; round brackets 
instead of square brackets; double dashes instead of single dashes; remove the single 
commas. Some of the sentence parenthetical test items contain a sentence terminator 
such as a period or an exclamation mark within the parenthesis. DIPETT’s scanner 
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recognizes those as the final sentence terminator; it terminates the sentence 
inappropriately; the sentence thus fails to parse. 

Relative Clause Phenomena. DIPETT failed to parse most of the TSNLP test items 
for relative clauses. A few sentences that did parse were of the form “The committee, 
the firm of which succeeds, comes."" and parsed incorrectly having an appositive and 
not a relative clause. All of the sentences that failed to parse have commas delineating 
the relative clauses, and the relative clauses are marked either by a relative pronoun 
or by a preposition followed by a relative pronoun. 

Questions. DIPETT’ s grammar for questions is deliberately simple. DIPETT was 
developed for extracting knowledge from text, and it was not expected that much 
knowledge would be extracted from an interrogative statement. Still, DIPETT parsed 
two thirds of the TSNLP question phenomena test items. DIPETT’ s performance on 
the TSNLP question phenomena was about the same as its performance on the whole 
TSNLP data. 

Tense and Modality Phenomena. TSNLP provides very systematic testing of tense 
and modality phenomena. DIPETT should be able to parse all of these sentences 
correctly, but a number of problems were identified in DIPETT’ s verb sequence 
grammar: 

• DIPETT is unable to parse verb phrases containing forms of “be being"". 

• DIPETT analyzes passive verb sequences as being stative. 

• The marginal modal “use to"" is sometimes not recognized. 

• Some tense-aspect-modal combinations are not recognized. 

• Some negative contractions are not recognized: daren ’t, usen ’t, usedn ’t, hasn ’t. 

2.3 Evaluation of DIPETT’s Performance on the Quirk Sentences 

Summary of Problems Identified. A number of problems were found in DIPETT’s 
parses of the Quirk sentences, and are listed below. Several of the problems identified 
by the by the TSNLP sentences were also identified by the Quirk sentences. The 
following problems in DIPETT’ s grammar were identified by the Quirk sentences but 
not by the TSNLP test items: 

• Adjectives overgenerate from adverbs. 

• Adverbs preceding a prepositional phrase are attached to that prepositional phrase. 

• Time-related and place-related noun phrases are not recognized as adverbs. 

• An adverb may be misinterpreted as an intensifier. 

• Intensifiers of adjectives and adverbs are conjoined with the head as though they 
were both conjoined by “and"". 

• Predeterminer position is restricted to the beginning of a noun phrase. 

• Some date forms are not recognized. 

• A noun postmodifier can be mistaken for a verb complement and vice versa. 
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• Compound sentences where one of the clauses is stative do not recognize two 
separate clauses. 

• DIPETT does not recognize some modal idioms, semi-auxiliaries, and catenatives. 

The following are covered by the Quirk sentence test suite but are not covered by 
DIPETT’ s grammar: pro-forms, fronting, the subjunctive in any form, prepositional 
verbs, the passive gradient and verbs that cannot be passivized, operator in reduced 
clauses, directives as distinguished from exclamatives. 



3 Improving DIPETT ’s Coverage 

Eollowing completion of the evaluation of DIPETT, modifications were made to the 
grammar of DIPETT v3.0 to produce a new DIPETT v4.0. Here we describe the 
improvements to DIPETT’ s grammar and a re-evaluation to measure the 
improvement in DIPETT’ s performance on the Quirk and TSNLP sentences. 

Optimizing on TSNLP. Despite the fact that the Quirk sentences provided a good 
diagnostic evaluation of DIPETT’ s grammar, we decided that the first goal for 
impr oving DIPE TT should be to improve its performance on the TSNLP suite: 

• As il Table~Tl shows, DIPETT performed worse on the TSNLP sentences than it did 
on the Quirk sentences. DIPETT generated a parse tree for 79% of the grammatical 
Quirk sentences but only 65% of the grammatical sentences of the TSNLP suite. 

• The TSNLP sentences are simple. Each TSNLP sentence covers a single isolated 
phenomenon; the Quirk sentences were deliberately chosen because they were 
more uncontrolled in their phenomena interaction. 

• The TSNLP sentences are repetitive. Eixing one sentence often fixes several, and 
can cause a dramatic performance increase. 

• Performance on TSNLP is a useful benchmark. Srinivas et. al. published [17] the 
performance of XT AG on TSNLP; they report that XT AG parses 61.4% of the 
grammatical sentences. 

Modifications to DIPETT’s Grammar. DIPETT’ s grammar is based on Quirk et. al. 
[15] and Winograd [18]. Natural language parsers are notoriously tricky, especially 
those written in Prolog. Parsing is also very sensitive to deviations in the lexicon. 
DIPETT is implemented in Quintus Prolog; the actual logic grammar of the parser is 
600 DCG rules in 3200 lines of Prolog, and the built-in dictionary of closed-category 
words contains over 3000 entries. Modifications to DIPETT were made as 
unobtrusively as possible to minimize the risk of introducing errors and side effects. 
The modifications included the following: 

• Extended correlative conjunctions to include “both.. and”. 

• Corrected parsing of passive sentences and allowed for nil passive subject. 

• Added the progressive form of the auxiliary verb be. 

• Completed the tense table (modal, aspect, and tense combinations). 

• Added missing negative contractions. 
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• Fixed the parsing of “used to” in the passive, stative, and in aspect and tense 
combinations. 

• Relative clauses, allowed comma delimiters, and relative clauses marked by a 
preposition followed by a w/i-word. 

• Added support for the mandative subjunctive. 

• Allowed declarative statements terminated by a question mark (for the TSNLP 
S_Types-Questions-Y/N_questions-Non_inverted-Non-tagged test items). 



3.1 Comparative Evaluation of the Two Versions of DIPETT 



Table 4 



and Table 5 summarize the results of parsing both test suites with both 



versions of DIPETT. | Table 6| shows a summary of the changes in DIPETT’ s 
performance on the TSNLP phenomena. The data included in | Table 6| is only for the 
sentences where a full parse tree changed, or a full parse tree was generated for a 
sentence where there had been a fragmentary parse. 

Eor the Quirk suite, thirty-eight new parse trees were generated, and thirty-three 
parse trees were changed. Almost all of these changes are improvements — most of 
the newly generated parse trees are acceptable, and the parse trees that are different 
have changed for the better. 



Table 4. Comparison of Total Number of Sentences Fully Parsed for Grammatical Sentences 



Grammatical Sentences 


Quirk et. al. chapter 


TSNLP 


2 


3 


5 


Number of grammatical sentences 


252 


308 


18 


1143 


Number of full parses generated by 


218 


221 


18 


745 


DIPETT v3.0 


(86%) 


(72%) 


(100%) 


(65%) 


Number of full parses generated by 


226 


248 


18 


1045 


DIPETT v4.0 


(90%) 


(81%) 


(100%) 


(92%) 



Table 5. Comparison of Total Number of Ungrammatical Sentences Rejected 



Ungrammatical Sentences 


Quirk et. al. chapter 


TSNLP 


2 


3 


J 


Number of ungrammatical 
sentences 


22 


44 


10 


1519 


Number of fragmentary parses 
generated by DIPETT v3.0 


7 

(32%) 


17 

(39%) 


5 

(50%) 


800 

(53%) 


Number of fragmentary parses 
generated by DIPETT v4.0 


6 

(27%) 


15 

(34%) 


5 

(50%) 


719 

(47%) 
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Table 6. Summary of Change in Performance on the TSNLP Phenomena 



Phenomenona 


Number of 
Parses 
Changed 


DIPETT v4.0 


Improvement 

in 


Fragnertary 

Parses 


Correct FUI 
Parses 


■3 

U_ y, 

8 


FUI Parses 


Correct 

Parses 


C_Agreement- 

Coordinated_sub j (correlatives) 


6 


0 


6 


0 


6 


6 


C_Complementation-Divalent 


1 


0 


1 


0 


1 


1 


C_Complementation-Trivalent 


1 


0 


1 


0 


1 


1 


C_Complementation-Tetravalent 


1 


0 


1 


0 


1 


1 


C_Diathesis-Active 


3 


0 


3 


0 


3 


3 


C_Diathesis-Passive 


22 


0 


22 


0 


12 


21 


C_Negation-Tense_aspect_modality 


199 


0 


199 


0 


125 


197 


C_Tense-Aspect-Modality 


86 


0 


86 


0 


45 


85 


NP_Modification-Relative_clauses 


112 


0 


106 


6 


99 


105 


S_Types-Questions-Wh 


5 


0 


0 


5 


5 


0 


S_Types-Questions-Y/N 


5 


0 


4 


1 


1 


0 


S_Types-Questions-Y/N_questions- 

Non_inverted 


13 


0 


13 


0 


13 


13 


Total 


454 


0 


442 


12 


312 


433 



4 Conclusions and Future Work 

The systematic procedure described in this paper helped to improve a rather complex 
parser of English sentences. Some lessons may be drawn by those who want to use 
the same method. 

4.1 Comparison of Effectiveness of the TSNLP and Quirk Test Items 

DIPETT’s parse trees for the TSNLP test items were compared with the analysis 
provided in the TSNLP tsdb( 1 ) database. The comparisons are necessarily manual, 
and therefore not easy. The TSNLP suite is very fine grained and pinpoints problems 
very precisely. Its design requires phenomena to be tested in isolation, not in 
combination. Phenomena coverage of the tsdb( 1 ) suite is not properly broad-coverage 
unless it has been extended. The test items for the basic sentence phenomena covered 
(complementation, modification, diathesis, modality, tense and aspect, clause type, 
coordination, and negation) are exhaustively complete. 

The Quirk-based test suite complements the TSNLP suite, and is useful in ways 
other than the TSNLP. The parse analyses of the Quirk sentences are not as readily 
available as they are for the TSNLP test items, and the evaluation of the resulting 
DIPETT parse trees required careful study. The Quirk test sentences are coarser 
grained than the TSNLP suite. They are broader in their phenomena coverage, but not 
so completely or exhaustively precise in their diagnostic coverage of each particular 
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phenomenon covered. Even though only two chapters of Quirk et. al. were used in 
the evaluation, the Quirk sentences broadly diagnosed most of the same problems that 
were identified by the TSNLP evaluation. The Quirk sentences also identified 
linguistic phenomena that are not covered by DIPETT and were not covered or 
diagnosed by the TSNLP sentences either. In particular, less controlled combination 
of phenomena in the test sentences helped to identify significant problems in 
DIPETT’ s grammar. 

The TSNLP method is designed to optimize control over test data, progressivity, 
and systematicity. Their test sentences are deliberately simple and the interaction of 
phenomena is highly controlled. Our experience shows that a test suite for a truly 
broad-coverage natural language parser must be corpus-like in its coverage of 
phenomenon interaction, in that corpora comprise arbitrary (but relevant) 
combinations of phenomena. 

4.2 Improvements to DIPETT’s Performance 

The modifications to DIPETT’s grammar were intended to improve DIPETT’s 
performance on the TSNLP suite, and a marked improvement in the number of full 
parses generated by DIPETT has been achieved — an increase from 65% to 92% full 
parses. The same improvements also caused a more modest improvement in the 
coverage of the Quirk sentences, even though no attempt was made to “tune” 
DIPETT to that test suite. The number of full parses on all of the 578 Quirk sentences 
rose from 79% to 85%. 

We worked with the assumption that a parser produces full parse trees only for 
correct data. Little improvement was shown in parsing (rejecting) ungrammatical 
sentences, but no attempt was made to optimize DIPETT for that purpose. Some 
grammar additions actually permit DIPETT to parse (accept) ungrammatical 
sentences that were rejected before. This condition is not entirely due to shortcomings 
in DIPETT’s grammar. Before meaningful results for ungrammatical sentences can be 
achieved, such sentences in the test suites must be pruned, leaving those found 
ungrammatical by a surface syntactic parser. Eor such parsers, the ungrammatical test 
items for the TSNLP complementation phenomena need not be used; neither should 
the Quirk sentences that are semantically but not syntactically incorrect. 

4.3 Future Work 

Extending the TSNLP test suite with test items extracted from Quirk et. al. has 
provided a broader tool for diagnostic evaluation of parsers of the English language. 
The future direction for this work is a systematic comparison of the effectiveness of 
the Quirk suite, and selections from the TSNLP test data, on a few publicly available 
Web-based parsers. 
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Abstract. To study the benefit of using an explanatory parallel mech- 
anism for cognitive processing, we propose a model which allows human 
performance and introspective theories to be tested in a unified way. The 
approach is verified by showing how it can be used to study natural lan- 
guage learning and understanding at both the utterance and discourse 
level within the context of coordinated visual stimuli. This study demon- 
strates that the validity of the underlying theories can be demonstrated 
by their ability to work together to form a unihed explanatory model of 
the performance of a young language user. 



1 Introduction 

Traditional symbolic modeling of cognitive processing has mostly focused on 
the sequential processing of a few tasks such as language or vision processing, 
planning, scheduling, and learning. While a number of symbolist researchers 
have attempted to parallelize the generalized search and inference used in these 
processes, most attempts to explore the explanatory nature of massive paral- 
lelism has been done in the connectionist domain. However, the connectionist 
approach’s ability to study unified cognition suffers from the daunting complex- 
ity of building and understanding a large scale simulation of a natural neural 
network. To attempt to gain from the strengths of both approaches, we propose a 
process by which the explanatory nature of massive parallelism can be explored 
using both a symbolic and connectionist framework. 

To demonstrate the utility of such a framework, we show that it has the 
ability to generate an explanatory model of a complex cognitive task. Since 
complex cognitive tasks are inherently difficult to model, we have chosen the well 
understood problem of combined language- vision processing, but have reduced 
the performance complexity of the system to that of a young child just learning 
to form two and three word sentences from her acquired lexical entries. This 
phase of language development is often referred to as telegraphic speech [B] . 

The resulting system of agents can be shown to: 1) learn new language el- 
ements and visual associations from adult language input, 2) process language 
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at both the utterance and discourse level, and 3) demonstrate a basic level of 
intentionality in the way each agent reacts to its surroundings. To support these 
capacities, each agent contains rudimentary elements of human language, vi- 
sion, and higher order processing which function together to perform cognitive 
tasks. While the agents contain, and are capable of learning, both semantic and 
procedural knowledge of language and the world, they are currently little more 
than little “Chinese rooms” when it comes to their actual intelligence. However, 
current evaluations of the system’s performance seem to demonstrate that this 
lack of intelligence may be, as some suggest, more a function of quantity than 
quality [^. 

2 Experimental Methodology 

Our research methodology can best be explained by briefly contrasting it with 
two of the most established symbolic modeling approaches, SOAR and ACT-R 
121, [2]. While our research shares the unified cognition scope and the mind/brain 
view of cognition of these more mature efforts, we focus more on the parallel pro- 
cessing of dissimilar cognitive tasks within the embedded mind, and thus, less 
on monolithic control and uniform representation [^. For example, ACT-R is 
viewed as both a formal model of cognition and a test platform for individual 
formal models of physiological processes. In contrast, we rely directly on the 
non-computational models of traditional philosophy, psychology and linguistics 
research to build non-formal computational models which span across a number 
of parallel processes, each having its own non-homogeneous symbolic represen- 
tation and inference method. 

Our use of distributed cognitive processing has a profound effect on the level 
to which consciousness, motivation and attention can be approached by our re- 
search. The serial nature of SOAR and ACT-R provides little explanatory depth 
for such studied and other agent based approaches, like Belief, Desire and Inten- 
tion (BDI) systems, require too much formalism for the current understanding 
of these phenomena. We believe that our ability to directly address these is- 
sues without relying on some overly restrictive formalism is a major benefit to 
our approach. For example, the unification of a collection of independent asyn- 
chronous cognitive processes within our system is accomplished by a common 
set of stimuli messages which serve as both the input and output of each dis- 
tributed process. This has the effect of distributing control of the system among 
those cognitive processes that have the broadest view of the particular task at 
hand. Task selection (or attention) can then be viewed as a resource allocation 
problem resolved through conflict management. 

At the heart of this research approach is a computational architecture, the 
Adaptive Modeling environment for Explanatory Based Agents (AMEBA), which 
allows computational models to be combined into a flexible parallel application. 
Using AMEBA, we can avoid locking ourselves into a single formal theory for 
the representation and inference method used by all cognitive processing within 
the brain. Driven by a set of non-formal human performance and introspection 
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based models, the explanatory nature of the computational model being used 
for one cognitive task within the system can be tested by its ability to perform 
in the predicted manner when it is not directly in control of the overall cognitive 
activity of the system under test. 

By building computation models that directly capture human performance 
and introspection, we have been able to avoid the overformalization of poorly 
understood phenomenon. Instead of formal proofs of the representation and rea- 
soning methods being used, the checks and balances in our approach result from 
the dynamic interplay of the individual components. The system forces experi- 
mental theories that were often developed in isolation to actually work together 
toward a common goal, thus allowing theories to be tested on how well they 
interact. While testing the interplay of non-formal theories is not necessarily a 
new concept, the AMEBA architecture allows us to do this interactive testing 
in a far more flexible manner. 

As would be expected, the merging of non-formal theories often generates 
quite unpredictable results, but we are seeing some very promising validations 
of the non-formal theories being used. The long-term goal is for the system to be 
able to generate new formal theories that can be tested by experimental studies 
with human participants. 

3 The Supporting Architecture 

The AMEBA architecture represents the refinement of generalized parallel tools 
to produce a special purpose parallel environment for testing cognitive theories. 
The architecture runs on a SMP cluster using a Beowulf-like connection scheme 
of multiple high-speed networks m- The architecture supports distributed pro- 
cess control and centralized agent design and management. While highly tailored 
to support system-wide knowledge management and the communication struc- 
ture needed for a cognitive modeling environment, its control structure is similar 
to more generic environments like the Parallel Virtual Machine environment |10| . 

At a theoretical level, AMEBA attempts to capture the explanatory force 
of a connectionist neural model while allowing the use of the better understood 
representation and reasoning methods of symbolic AI. At a computational level, 
it provides processor transparency within a parallel system and a flexible method 
of process and knowledge management. The key element in the solution of both 
sets of requirements is the etheron process template shown in Figure 1. An 
etheron provides a container for an instance of any inference or routing mecha- 
nism needed by the system. Once contained, the etheron supports the mechanism 
with: 1) a standard way to load and store knowledge, 2) interfaces to AMEBA’s 
management tools and 3) a generalized set of communication channels for talking 
with other etherons. 

Using AMEBA’s management software, a user can dynamically build a sys- 
tem of agents out of a set of predefined etheron types and control the internal 
knowledge loaded into each etheron during operation. Since etherons support 
the ability to be started, stopped and moved independently, the user can dy- 
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namically select which portion of a parallel system to activate. This feature has 
proven very useful in troubleshooting a system where the process execution of 
each node is purely asynchronous of other process nodes. 

During the building of an etheron type, the builder can use predefined in- 
ference methods or build a special method for the particular task at hand. The 
AMEBA libraries provide support for a temporal logic based production system, 
a temporal-modal semantic network, a back-propagation ANN, and a generic 
database lookup tool built on top of PostgreSQL. 

Etherons can be connected to each other in a lattice, like the neurons in a 
neural network, however, we have found it easier to model agents using a tree 
structure. While they are somewhat similar in design to an artificial neuron, 
etherons function more like a neural analyzer, a sub-network in the brain which 
serves a particular processing function HI], m- 

Intra-agent communication between etherons is defined using a set of stimulus 
messages. These messages are defined in such a way as to reasonably simulate 
the level of information being passed between real neural analyzers within a 
collection of neural stimuli. The inter-agent communication process is also based 
on message passing, but the architecture leaves the details of the message content 
up to the system designer. Both intra-agent and inter-agent messages can be 
routed via: 1) an etheron address, 2) local multicast or, 3) broadcast. However, 
messages can only cross between the system and agent domain via an interface 
etheron. 
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4 The Computation Model 

The telegraphic speech stage of human language development normally occurs 
at about 24 months of age after the child has been using holophrastic (or single 
word) sentences for over six months. It is characterized by the absence of function 
words, tense and word agreement. In other words, it demonstrates only the 
simplest syntactic relationships between words in the utterance while providing 
the user with a semantically rich set of useful speech acts. This form of speech 
is useful to a child who can reasonably expect adult listeners to reinforce both 
the speech act and general language performance both directly and indirectly. 
Like all other stages, the telegraphic speech stage is learned within the context 
of the normal adult speech being used with and around the child. 

Our computational model attempts to capture the process of language under- 
standing and learning during the telegraphic speech stage of development. Using 
AMEBA, we have constructed a language processing environment consisting of 
three agents, a teacher agent and two student agents. The system broadcast all 
inter-agent messages to simulate language learning during immersion. 

The current teacher agent is made up of three etherons that provide a text 
based interface for entering adult level utterances and capturing the telegraphic 
speech level utterances of the students. The student agents are far more complex. 
As shown in Figure 2, they currently consisting of 18 functioning etherons with 
two additional etherons currently under development. While we have build other 
cognitive models with AMEBA to explore other aspects of cognition outside the 
scope of this language system, this system is our primary research focus and 
is expected to continue growing as we explored non-language cognitive task in 
more detail. 

Currently, the students utterance processing elements are supported by a 
simplified set of vision elements, allowing the study of visual symbol grounding. 
They are also supported by a set of Higher Order Process (HOP) elements used 
to drive the overall intentionality of the system and process the higher level 
language functions of discourse, conversation and social acts m, ca. Since a 
student agent must glean enough information out of an adult speech utterance 
to learn and use telegraphic speech, the majority of the language processing is 
dedicated to understanding. A major reason for picking the telegraphic speech 
level to study is that the agent has to do very little to the underlying semantic 
structure of an utterance to speak it. 

In keeping with most cognitive theories, the research system divides knowl- 
edge storage in the agent into Long Term Memory (LTM) and Short Term 
Memory (STM) storage jl], |1]. While being strongly influenced by the Baddely 
and Hitch’s working memory model, the system extends the concept of STM 
past their phonological loop and visuo-spatial scratch-pad to provide a separate 
STM element in all three classifiers and the Semantic Reasoner. While the sys- 
tem uses some episodic knowledge to handle discourse, most knowledge stored 
in the system is either procedural or conceptual. 

One original goal of AMEBA was to build a small set of generic reasoners 
(e.g., production systems, semantic networks, etc.) which could be reused in a 
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Fig. 2. The component structure of one of the student agents 



number of different cognitive roles with little or no additional glue code. These 
reasoners were expected to support both the STM and LTM aspects of knowledge 
storage in an etheron. Research with AMEBA has shown that the lower (closer to 
the external input/output) a function is, the harder it is to meet this goal. In the 
research system, we have been able to reuse a set of three reasoners (a production 
system, semantic network, and database based pattern matcher) for the Long 
Term Memory (LTM) storage in all etherons. While a STM library provides 
some common functionality, most STM support below the HOP processes has 
ended up as special purpose code for each element. 

The Vision, Lexicon, and Syntactic Classifiers use traditional SQL commands 
to retrieve and store LTM elements. The Vision Classifier is only a simulation 
of the expected output of a more realistic vision system to be built later. This 
simulation already seems to demonstrate the need for an explanatory STM repre- 
sentation which functions much like a conceptualized visuo-spatial scratch-pad. 
The Lexical Classifier uses a STM that closely reassembles a phonological loop 
except that it stores the input from all speakers without the attenuation that 
some attention theories propose Pj. To date, we have been able to avoid the 
complexity of most of attention and attention switching by doing no actual 
phonological processing of the input language stream. At present, input is al- 
ready broken down into words by the sending agents. The network level API 
used to transmit these words forces the serialization of all messages. 
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The Syntactic Classifier uses a stored set of part-of-speech patterns to gen- 
erate a governor and government relationship for most non-function words in an 
utterance. It is based on aspects of several linguistic theories, primarily Principle 
and Parameters theory, and works by extracting positional information directly 
from the surface form of the utterance. Ignoring surface-to-deep structure move- 
ment (e.g., wh-movement, ?-movement, etc.) seems to fully support the level 
of understanding needed for telegraphic speech since we are for the most part 
ignoring the subtleties in meaning generated by the surface manifestation of a 
deep structure. 

The knowledge that the Syntactic Classifier uses is broken into legal patterns 
of part-of-speech sequences and government rule associated with these patterns. 
While the classifier is capable of learning new rules, the pattern set must be 
fixed. This at first generated some concerns that the mechanism being used was 
not explanatory, but further analysis has reveled that most of these patterns 
are most likely learned prior to the telegraphic speech stage using something 
like Language Acquisition Device (LAD) proposed by Government and Binding 
theory. 

The Semantic Reasoner uses the AMEBA’s generic semantic network for its 
LTM storage. The STM portion of this etheron creates and maintains an acti- 
vation list which supports such explanatory concepts as priming and recall. The 
reasoner’s knowledge representation is capable of maintaining the level of belief 
of the concepts and relationships, as well as, the temporal and modal context 
of a relationship. This allows this reasoner to store both language and world 
knowledge within the same representation including the beliefs of other agents. 
While this unified conceptual representation works for the current system, it in- 
troduces some explanatory and scaling concerns which will need to be addressed 
in the future. 

All of the HOP etherons use the AMEBA’s generic production system for 
both LTM (rules) and STM (facts) storage. The rule set being used in the current 
system is fairly basic and could reside in a single Knowledge Base (KB) system. 
It has been divided in the current system to allow for growth and to study the 
effect of parallel rule firing. Facts in the KBs are used to maintain the current 
context of the inference and are not used for long-term storage of knowledge. Fact 
are generated by internal assertion and external stimulation of other reasoners. 
The KB’s support a mechanism for natural decay of factual information. 

Information is passed between etherons in an agent using stimuli messages. 
Since the goal of an etheron network is to emulate a collection of neural analyzers, 
stimuli messages have been kept as simple as possible. The basic structure of a 
stimulus is: 



[x]name{parameter[0],parameter[l], ...parameter[n]) . (1) 

where x is an optional modifier and name is a three letter stimuli type name. 
While the AMEBA communication API takes care of encoding and decoding the 
parameter list, it enforces no semantic structure on the list. How these parame- 
ters are used is defined by the sending and receiving etheron. For example, in the 
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current system, a TAG stimuli type is used to tag an input word with its speaker 
and the order in which it is received, and a POS stimuli type is used to assign a 
part-of-speech to a tagged word. All etherons in the language sub-network know 
how to extract the information encoded in both of these stimuli types to do their 
particular task. 

By using the modifier field of a stimuli type, additional semantics are added 
to a stimuli type’s meaning. For example, an unmodified POS stimuli is always 
interpreted as a positive assertion about the part-of-speech of a word. If an 
etheron cannot absolutely assert the part-of-speech of a tagged word, but has 
evidence that it might be some part-of-speech, it can send a *POS stimulus which 
is interpreted as a positive hypothesis about the contents of the POS stimulus. 
If it has evidence that a certain tagged word cannot be some part-of-speech, it 
can send a IPOS stimulus which is interpreted as a negative hypothesis about 
the contents of the POS stimulus. A ?POS stimulus is used to ask a question 
about the truth of contents of a POS stimulus. 

As words arrive at the Agent Interface, they are converted to TAG stimuli 
and these stimuli are local multicast to all etherons directly connected to the Ut- 
terance Stimuli Router. After the TAG message is sent, the resulting processing 
becomes highly parallel. While the Lexical Glassifier is trying to create a POS 
stimulus for the word, the Syntactic Glassifier is trying to construct sentences 
out of the input words and relate these words to the part-of-speech information 
retrieved by the Lexical Glassifier. Even before the POS message gets to the 
Syntactic Glassifier so it can generate a SYN (syntax) stimulus to send to the 
Semantic Reasoner, the Semantic Reasoner is activating the word nodes in its 
LTM and starting to generate the SEM (semantic) stimuli it will broadcast to 
the HOPs and vision system based on the total utterance. At the HOPs, these 
SEM stimuli are being matched with each other and the REF (reference) stimuli 
being generated by the vision sub-network to fire rules of discourse, socialization 
and planning which determine both the response needed and how this response 
should be carried out. 

Learning in our system is incorporated as part of each etheron’s normal 
knowledge processing, and not as any separate or distinct machine learning al- 
gorithm. Utterance level learning occurs when a higher level system is able to 
feed-back a hypothesis it used in place of the data it would normally receive 
from an input level system. For example, both the Syntactic Glassifier and Se- 
mantic Reasoner use part-of-speech information to complete their task. If the 
Lexical Glassifier is unable to send a POS stimulus, the Syntactic Glassifier and 
Semantic Reasoner try to figure out the part-of-speech of a word by context. 
These hypotheses are then fed back to the Lexical Glassifier which either adds 
a new part-of-speech record to its LTM or modifies its confidence in a record. 
Records that fall below a certain confidence level are removed from the LTM all 
together. AMEBA also supports the dynamic modification of a production sys- 
tem’s rules (i.e., live experts) which we will be using in the future for discourse 
level learning. 
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5 Results 

The current results fall into three areas: 1) the system’s performance on language 
task requiring external and internal symbol grounding, 2) the system’s ability 
to learn new language, and 3) what the system has demonstrated about the 
language theories being used. Since the processing used in both understanding 
and learning is complex, we can only briefly address these here. 

5.1 System’s Language Understanding Performance 

Currently the system demonstrates the ability to: 1) process and appropriately 
act upon adult level utterances, 2) resolve symbol grounding from visual infor- 
mation and stored conceptual information, and 3) engage in short dialogues re- 
quiring the understanding of adult level speech and the generation of telegraphic 
utterances. 

As a brief example of the system’s processing performance, take a very simply 
dialogue spoken with the scene database containing the representation, itemO 
(the teacher) holds item! (a ball) colored item? (blue): 



Teacher: 


“Look at me! What is this?” 


(la) 


Student: 


“ball” 


(lb) 


Teacher: 


“What color is it?” 


(Ic) 


Student: 


“blue” 


(Id) 



To further simplify this example, we will assume that all utterance level etherons 
already have the knowledge to process this discourse so that no feed-back learning 
is required. 

The first part of la is just used to point the student’s attention to the teacher 
so we will ignore it in this discussion. Given the second part of la, the Hearing 
Interface receives and tags the words and the Lexicon Classifier creates three 
part-of-speech stimuli for them. The Syntax Classifier uses this information to 
look up a matching phrase pattern and generates two syntax stimuli that state; 
‘in sentence 1, word 1 is the object and word 3 is the subject of word 2’. (Note 
here that the wh-movement of this expression is built into the pattern and its 
associated rules.) 

Using the above information, the Semantics Reasoner first activates all pos- 
sible meanings of the three words, and then, uses the syntactic information to 
reduce the ‘is’ to the equative (or linking) form and the ‘this’ to a grounding 
marker. It then generates two semantic stimuli that state; ‘in sentence 1, the 
teacher requests the name of ground marker 1 and the teacher contains ground 
markerl’. 

The semantic stimuli causes the Ego HOP to generate a stimuli stating ‘for 
sentence 1, 1 want to answer’ since both students have rules which increase their 
pleasure if they answer the teacher. The semantic stimuli causes the social HOP 
to generate a stimuli stating ‘for sentence 1, I should answer’ since the teacher 
has not told them to attend and has not told them to be quiet. These stimuli 
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cause the Conversation HOP to generate a stimuli indicating the agent has the 
turn. Using all of this, the Discourse HOP generates a reference question stimuli 
asking; ‘regarding sentence 1, how can the statement, teacher contains ground 
markerl, be resolved’. This question is seen by both the Semantic Concept Rea- 
soner and the Vision Classifier, but only the Vision Classifier will provide useful 
information. It does so by looking in the scene to see that the teacher holds 
item4 and item4 is color item?. If it has a correct symbol name for item4, it 
would return a reference stimuli that state ‘regarding sentence 1, the teacher 
holds ball’. 

Using all of this input, the Discourse HOP then generates a response stimuli 
that state; ‘respond with ball’. It also internally stores ‘ball’ as the current 
symbol ground of the current discourse. The response stimuli is currently used by 
the Speaking Interface to respond to the question. In the near future, this stimuli 
will be converted by the Surface Structure Generator into a more ‘grammatical’ 
response before being sent to the Speaking Interface. 

The process for answering Ic is basically the same as la except that the 
stimuli from the Semantics Reasoner will indicate to the Discourse HOP that 
the referent is discourse related and it will substitute the word ‘ball’ for ‘it’ in the 
resulting reference question. Except that the Vision Classifier is already primed 
to answer this question, the rest of the process is also the same. 

As an brief example of the system’s concept memory use, let us look at the 
following dialogue spoken with the scene database containing the representation, 
itemO (the teacher) points-to iteml6 (a cow) in itemlO (a picture): 



Teacher: 


“Look at the picture! What is this?” 


(2a) 


Student: 


“cow” 


(2b) 


Teacher: 


“What sound does it make?” 


(2c) 


Student: 


“cow moo” 


(2d) 



The discourse up to 2c would follow the same processing path as discourse 1. 
However, when the Discourse HOP ask for the sound of a cow, this question 
would cause to Semantic Concept Reasoner to look for an intersection between 
the concept ‘cow’ and ‘sound’. It would then find and return an answer of ‘moo’ 
which would be processed exactly as if it came for the Vision Classifier. Since 
the Semantic Concept Reasoner is aware of the Vision Classifier answers, it is 
also primed is support questions like: 

Teacher: “Look at the picture!” 

“What sound does this animal make?” (3) 



5.2 System’s Language Learning Performance 

In the example above, if the Vision Classifier has no reference for a visual item, 
no output will be generated but the HOPs will start listening for the other 
student or teacher to give the answer. Once the answer is confirmed by the 
teacher, the student updates its knowledge (i.e. learns) that the visual item has 
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that name. While this concept learning is very useful, the major focus of our 
learning research is focused on language, and not concept, learning. 

As an brief example of the system’s learning performance, take the input 
utterances: 



Teacher: 


“John hit the ball.” 


(4a) 


Teacher: 


“The ball is blue.” 


(4a) 


Teacher: 


“Mary rolls the blue ball.” 


(4c) 



where the agent starts with no information about the word ‘ball’ and no asso- 
ciation of a syntax rule for the position filled by ‘ball’ in the third sentence. To 
further simplify this example, we will speed up the way feed-back learning is 
done since an agent would actually need about over 30 sentences with the word 
‘ball’ in them to completely learn its linguistic meaning. 

When the first sentence is received, the Hearing Interface would again tag the 
words and the Lexicon Classifier would create part-of-speech stimuli for all but 
the word ‘ball’. This would cause the Syntax Classifier to look up all patterns 
that match the phrase length and known part-of-speech slots and generate a 
positive hypothesis about the missing part-of-speech. This positive hypothesis 
would then cause the Lexicon Classifier to begin creating a belief that ‘ball’ is 
a noun. Given enough examples of ‘ball’ filling a noun position, the Lexicon 
Classifier’s belief grows to the point that it will assert that ‘ball’ is a noun when 
stimulated by a tag stimulus. 

Now that the part-of-speech of ‘ball’ is known, in getting sentence two the 
Semantic Reasoner will try to activate the noun ‘ball’ and fail because it does 
not exist. This causes the Semantic Reasoner to create a new node for ‘ball-noun’ 
and link it, with a minimum level of belief, to the concept ‘physical-object’ since 
it, like ‘physical-object’, is capable of having a color. As the concept ‘ball-noun’ 
is used in other context, additional semantic and conceptual links are formed. 

We will summarize the way the Syntax Classifier learns a new rule since it 
is a little more complex and is currently being redesigned. Since all of parts- 
of-speech for the third sentence are known, the Semantic Reasoner will get a 
complete list of these, as well as, any rules the Syntax Classifier can supply 
about the pattern. The Semantic Reasoner then attempts to propose a syntax 
rule for the position held by ‘ball’ based on its semantic and conceptual links. The 
proposal, or proposals, are sent to the Syntax Classifier as positive hypotheses. 
Since misformed rules can seriously impact the system’s future performance, the 
Syntax Classifier does a bit of reasoning about the proposed rule before inserting 
it. The stored (flat) surface pattern for sentence three is ‘NVdAN’ with upper- 
case letters being open and lower-case being closed categories. In the current 
stable system, the Syntax Classifier has a very simplistic understanding of how 
government relations work (in English), it checks the proposed rule against the 
simplest form of the base pattern ‘NVN’ by assuming that ‘dA’ is governed by 
the last ‘N’. If this rule matches the one proposed, the rule is added with a 
minimum level of belief. This non-tree approach has proven to not scale and is 
being replaced with a method which relies directly on X-bar syntax. 
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5.3 System’s Language Theory Feedback 

A early result of the research demonstrated that the current system design needs 
a syntactic word loop to function. This would seem to validate Baddely and 
Hitch’s working memory model although it is not entirely clear how our cur- 
rent hop’s relate to their Central Executive. What is more surprising is how 
well the implemented Principle and Parameters theory components are handling 
language learning. Almost every problem we have encountered with the current 
method of language learning has a solution which is directly suggested by Prin- 
ciple and Parameters theory. For example, we were having problems learning 
different categories for the same surface form until we started applying a sim- 
ple level of X-bar syntax, and as stated above, we are now moving toward a 
complete implementation of X-bar syntax to improve syntax learning. With the 
further addition of movement rules and recursive government relationship, it is 
expected that the system will, in the future, be able to understand and generate 
most adult speech. 

The system has also demonstrated the need, as Jackendoff’s suggest, for an 
unified conceptual reasoning which he proposes to take place below the language 
level |8]. However, our research has demonstrated that you must also do some 
level of reasoning above this level. We currently have to rely on too much world 
knowledge during semantic processing to completely separate the semantic and 
conceptual reasoning. The system also does not yet fully demonstrate the con- 
cept that conceptual learning relies on the processing of episodic memories or 
that procedural learning relies on the collection of concepts and simple rules as 
proposed by Anderson and others. But these limitation seem to be caused by 
the current scope of our evaluation system and not these theories themselves. 
Overall, it appears that the our system is proving to be a valid test platform for 
testing human performance and introspection theories. 

6 Future Work 

In the next phase of the research we need to better understand and deal with 
the relationship between language and world knowledge in the system and to 
demonstrate learning at the discourse and socialization level. We are also starting 
to look at vision processing in more detail, as well as, language generation. 
To further our understanding of language processing and development, we are 
looking at a study of both the holophrastic and adolescent development stages 
of human language. Beyond these, we also want to improve the system’s ability 
to reason and plan with world knowledge, control actuators and support a real 
vision system. 

The AMEBA has proven to be a powerful and flexible tool, but the current 
system is far from being ready to simply plug into real-world application. While 
non-determinism was a goal in our parallel design, we seem to be getting a little 
more than we bargained for in the current system. We will also need to be able to 
better characterize how the system will perform each time it is operated before 
we can propose its use in a production grade AI system. 
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7 Conclusion 

Working with the existing system has been both exciting and daunting. We have 
been able to demonstrate that human performance and introspection theories can 
be directly applied to unified cognitive processing. The current system, while still 
very young has pointed to the validity of linguistic and experimental psychology 
theories which are often ignored in computational research because of their non- 
formal nature. However, it is clear that our research has only scratched the 
surface of this line of experimentation and must still prove its merit against far 
more mature lines of research. Our research is best viewed, not as a solution to 
any grand challenge in Artificial Intelligence or Cognitive Science, but as a point 
of reflection along the road to a unified model. It is hoped that this paper will 
stimulate a discussion of both its merits and oversights. 
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Abstract. Electronic commerce websites often have trouble keeping up 
with the large amouut of customer-service related email they receive. 
One way to alleviate the problem is to automate responding to that 
email as much as possible. Many customer messages are in essence fre- 
quently asked questions, for which it is easy to provide a reply. This 
paper explores a staged approach to message understanding: an incom- 
ing message is first classified in a specific category. If the category of 
the message corresponds to a specific frequently asked question, the an- 
swer is provided to the customer. If the category corresponds to a more 
complex question, a finer understanding of the message is attempted. 
Messages are categorized by a combination of Bayes classifier and regu- 
lar expressions, that significantly improves performance compared to a 
simple Bayes classifier. A hrst version of the system is installed on the 
FTD website (Florist Transworld Delivery). It can classify more than half 
of the customer messages, with 2.3% error; three quarters of the catego- 
rized messages are frequently asked questions, and receive an automatic 
response. 



1 Introduction 

Recent studies by several media research companies have underlined the poor 
performance of many electronic commerce websites in terms of customer service. 
A study of 125 “top websites” by Jupiter Communications [2] indicates that 42% 
either never responded to their customers’ needs, or took more than 5 days to 
respond, or do not offer email options to their customers. In another study of the 
100 largest companies websites, Brightware [I] found that only 15% answered a 
simple email query (“what is your headquarters address?”) within three hours; 
36% could not be emailed from their website; 10% never answered. 

Both studies indicate than about 50% of the websites fail to provide satis- 
fying customer service. The most likely reason for this high failure rate is that 
many customer service departments are not ready to deal with unexpectedly 
high quantities of customer email. As the number of internet users and on-line 
buyers continues to grow, e-commerce companies have to take action to solve 
this customer service problem. 
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Currently, there are several types of tools available to help companies deal 
with customer service interaction. Some tools make it possible for a customer 
to talk live to a customer service representative over the web; other tools help a 
pool of representatives deal with incoming email (by offering a centralized queue, 
pre-defined responses for frequently asked questions, and monitoring facilities). 
These tools are mostly ports to the internet of phone-based customer service 
technologies. While they may be useful in improving the customer service effi- 
ciency for the company, and experience for the customer, they fail to take into 
account a major aspect of the internet: the possibility of automation. 

Because customer input over the internet comes as text instead of voice, the 
interaction between customer service and customer can be partially automated. 
Automation will help the retail company decrease its customer service costs, 
and will also improve customer experience by providing immediate response. 
The response will be immediate even in burst demand situations, such as the 
demand for flowers prior to mothers day or toys prior to Christmas. Because 
response will be provided by a program, it will also always be consistent: cus- 
tomers asking identical questions will receive the same reply. A well designed 
system will give only relevant feedback (when the customer input is understood 
with a high enough degree of certainty) and refer to a human representative in 
case of uncertainty. 

In this article, we present the first version of the automated customer service 
software Interact. Section |5] addresses the architecture of the software; section |3] 
presents the text classification technology; section 2]discusses the implementation 
of Interact on the Florist Transworld Delivery e-commerce website, www.ftd. com. 



2 The Interact Staged Approach 

2.1 Different Types of Customer Messages 

Incoming customer service messages can be divided into two types: messages 
that can be answered by a pre-defined reply (called type I), and messages that 
need a specific answer (called type II) . 

Examples of type I messages are frequently asked questions like the ones found 
on the large number of FAQ-lists available on the internet: for example “Do you 
deliver on Sundays?”, or “What is your return policy?”. Comments (e.g. “Your 
site is great”) or advice (e.g. “You should have more choice”) are also type I 
messages. These messages form a large portion of the incoming customer email. 
For example, over 75% of the order-form suggestions on the FTD website are of 
type I (measured on 6000 messages). Automating only the answering of type I 
messages would therefore be a major benefit. 

Type II messages need to be answered with a specific answer, depending on 
each message. Examples of such messages may be “I forgot my password, please 
send it back to me”, or “Give me my AAA 10% discount”, or “Is that shirt 
available in blue?” . 
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2.2 Categories of Messages 

Incoming messages can be grouped by categories corresponding to the topics 
they deal with; for example messages like “great site”, “I like your site”, “keep 
it up” will be grouped in the category compliment. Other examples of categories 
could be have more choice, password problem, discount request. Each application 
of Interact to a specific website will have a specific set of categories, although 
evidence suggests that a large number of categories (e.g. the ones mentionned 
above) are common to many electronic commerce applications. We are currently 
working on a customer-service message ontology, that will help us define cate- 
gories for new applications. 

Some of these categories correspond to type I messages (e.g. eompliment, have 
more choice, lower your prices...) and others correspond to type II messages (e.g. 
password problem, diseount request, question about product...). 

2.3 The Interact Architecture 

When a new message comes in, Internet’s first operation is to classify the message 
into a specific category. Knowing the category of a message provides only a crude 
understanding of the message, but in many cases this is enough to answer the 
message: if the message falls into a type I category, then Interact simply sends 
the corresponding pre-defined reply back to the customer. 

If the message falls into a type II category, it is necessary to extract more 
information from the message before being able to answer it meaningfully. For 
example, if a message falling in the discount request category says “please give 
me the 10% AAA discount, my membership number is XXX”, the system needs 
to extract the type of discount (“AAA”), the value (“10%”) and the member- 
ship number (“XXX”). When these data are retrieved, the system can look up 
databases and business rules, decide whether the discount is applicable or not, 
and use a reply template to compose the appropriate answer. 

If the information is not present in the message, or if it can’t be retrieved, then 
the system needs to collect that information from the customer. For example, if 
the incoming message says “please give me the 10% AAA discount”, the system 
needs to answer “Please provide us with your AAA membership number” to the 
customer; in brief, the system must start a conversation with the customer. 

These three possibilities (message is of type I; message is of type II and all 
necessary information is present; message is of type II and some necessary infor- 
mation is missing) dictate the Interact architecture shown on figure [1] The three 
message processing modules (classification, information extraction and conversa- 
tion management) deal with messages of growing complexity; each one is called 
only if the previous one could not permit the generation of a meaningful answer 
to the message. 

In the current version of Interact, only the classification module is imple- 
mented; therefore only type I messages receive a reply. Type II and unknown 
messages are forwarded to a customer service representative (type II messages 
can be routed to specific representatives according to their category) . In the next 
section, we examine the classification module in more details. 
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Fig. 1. Automated customer service architecture 



3 Classification Technology 

Interact’s classification technology is based on another duality observed in the 
message categories; some categories, such as compliment, can be expressed by 
a wide variety of words and phrases. On the contrary, other categories, such 
as discount request, or send catalog, concern precise topics and often contain a 
few specific words. This duality prompted the two- level classifier architecture we 
have defined, that combines two technologies: naive Bayes classification and 
regular expressions. 

3.1 Naive Bayes Classification 

Naive Bayes classification considered one of the most efficient means of text 
classification (see |3|, p. 180). Indeed, it proved to be the best for our application 
compared to other techniques we experimented with, for example ensemble of 
oblique decision trees 00 and least squares fit mapping . This section briefly 
introduces the naive Bayes classifier and the threshold computation mechanism 
we added to it. 

Naive Bayes Algorithm A naive Bayes classifier determines the category c of 
a document composed of n words wi,W 2 ---Wn (in no particular order) according 
to the following equation: 
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c = argmax P{cj) I I P{wi\cj) (1) 

c,eC 

where C is the set of categories, P{cj) is the probability of the category Cj 
and P{wi\cj) the probability that the word Wi appears in documents of category 
Cj. These values can be estimated from a training set composed of a series of 
documents categorized by a human. When a message is classified, the Bayes 
classifier returns both the category and the probability that the message belongs 
to that category (see [4j for more details). 



Threshold Learning A crucial aspect of our application is that the rate of 
false positives (messages classified in the wrong category) must be very low, 
even if that means fewer messages will be classified at all; giving the customer 
irrelevant feedback is worse than giving no feedback at all. In conventional text 
classification terms jS], precision is more important for us than recall. 

In order to enforce a specific maximum rate of false positives, a threshold 
can be set so that the Bayes classifier’s output is accepted only if its probability 
is high enough. 

The simplest way to compute the threshold is as follows: to begin, the maxi- 
mum rate of admissible false positives is chosen by the administrator of Interact 
(e.g., 3%). Next, the threshold is set to zero (meaning the classification proposed 
by the classifier will always be accepted). Then the classifier is tested on a test 
corpus, and the rate of false positives on this corpus is computed. If this rate is 
too high, the value of the threshold is raised by a fixed, small increment, and 
the test is run again. The threshold is set to the desired value when the rate of 
false positives produced by the classifier on the test set is below the maximum 
rate. 

This threshold computation method has one drawback: it is efficient only if 
the messages are equally distributed among the categories. The probability that 
an example is in a specific category is limited by the frequency of that category 
in the training set (see Equation [I]) . Therefore, if a message belongs to a rare 
category, the probability associated to its classification will be low, and possibly 
lower than the global threshold computed by the simple method. To alleviate this 
problem, we refined the threshold computation method by defining one threshold 
per category rather than a unique threshold; the multiple threshold computation 
algorithm is similar to the simple one, except that the threshold for a specific 
category is raised as long as too many false positive are produced in this specific 
category. The improvement obtained with this method compared to the simple 
method is illustrated in section 

3.2 Regular Expressions 

Regular expressions can detect specific patterns in a sentence; for example, the 
expression send [~\ . ] ^catalog will match messages containing the word “send” , 
followed by the word “catalog” but with no period between both (i.e. “send” and 
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“catalog” are in the same sentence). Unlike simple keywords, regular expressions 
can take into account word order, word combinations, synonyms, punctuation, 
etc. By examining a set of messages of a given category and looking for common 
patterns, the administrator can design one or more regular expressions for that 
category. 

Regular expressions are well adapted for specific messages that tend to be 
expressed in a limited number of ways; for example customers asking to be 
sent the catalog of the company. The administrator must make sure, during the 
regular expression development process, that the proposed expressions do not 
generate many false positives. This is achieved by defining very specific regular 
expressions, like the one shown above, that are not likely to match unwanted 
messages. This would be much more difficult to achieve using only keywords. 
The improvement obtained by adding regular expressions to the naked Bayes 
classifier is discussed in section jd] 



3.3 Interact’s Classifier 

Interact ’s classifier is depicted in figured An incoming message is first classified 
by the Bayes classifier. If no category is recognized, the message is classified by 
the regular expressions. We chose this architecture because regular expressions 
do not have the threshold “safety mechanism”, so we trust the Bayes classifier 
more, and because the Bayes classifier can identify a larger number of messages 
than regular expressions. Performance of the classifier on a specific application 
is given in the next section. 
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Fig. 2. Interact classifier 
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4 Interact on the FTD Website 

4.1 Context 

A first version of Interact, that handles only type I messages, is installed on 
the www.ftd.com website (Florist Transworld Delivery). On the FTD website, 
customers can leave a message (in the form of free text) in the suggestion field 
on the order form (see Figure E]). 



File Edit View Go Communicator 


Help 


^ Bookmarks Location: jhttps://ordering. ftd. com/internet/cgis/orderpage. cgi 


/j What’s Related J2 








Your Comments 



How did you hear about "ftd.com”? 

vTelevision ad vFTD catalog vFTD internet ad 

viSearch engine v-Print ad 

vOther : p 
Any suggestions? 

[J I I !ii| Ea 1| 



Fig. 3. Portion of the FTD order form 



4.2 Categories of Messages 

Interact currently classifies messages in one of 31 different categories. Approxi- 
mately 75% of the classified messages fall into a type I category; Interact answers 
them by writing a pre-defined message on the order confirmation form, shown 
to the customer after the order has been recorded. Table [I] provides some exam- 
ples of type I categories, customer messages and the corresponding pre-defined 
replies. 



4.3 Classification Performance 

In this specific application, every customer is prompted for suggestions, and 
many customers explicitly say they do not have any. These “negative” messages 
don’t need a reply, and are easy to classify. On the opposite, we wish to give 
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Table 1. Examples of type I categories, with actual customer comments and 
pre-defined replies on the FTD website. Underlined text denotes a web link. 



please confirm 
delivery 


• Send back confirmation of delivery and time of delivery on email. 

• Can someone please send me an e-mail confirming reciept of this 

order Thanks 

• e-mail verification of my order would be appreciated. 


We appreciate your comments. At this time, we do not automatically 
confirm delivery of orders. If you have concerns about your order, please 
complete our Order Inquiries Form. 


lower prices 


• Lower the price of Roses. They are good but not that good. 

• I think you could offer lower prices for local deliveries. 

• Pretty expensive-lower your prices. 


We appreciate your comments. FTD makes every effort to keep our 
online prices competitive and to offer fresh beautiful flowers at market 
value. In a recent review, we found that our prices for the same and sim- 
ilar products were less than or equal to those of our major competitors 
on the Web. 


message box 
is too short 


• Increase the number of characters you can type for the message. ..it 
is way too short! 

• Enable messages of greater than 150 characters. 

• The message box doesn’t allow for enough words. You need more 
room for a ’personalize’ message. 


We appreciate your comments. Our florists’ gift cards are not much 
larger than a standard business card. Your message will be hand-written 
on this card. For this reason, we need to limit the number of characters 
in your message. The next time you visit our site, you may try our 
Quotable Sentiments (sm) library for a message that will easily fit on 
the gift card. 
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a reply to every positive message (actual suggestion or comment). We give the 
performance measurements of Interact both on the whole set of messages, and 
on the set of positive messages only. 

The Bayes classifier was trained on 6000 customer messages received consec- 
utively and classified by a human. Multiple thresholds were then computed to 
limit the false positive rate to 3%. The classifier uses 25 regular expressions that 
were hand-crafted by looking for patterns in the same set of 6000 messages. The 
performance figures were measured on a test corpus of 795 messages classified 
by a human; all the messages were previously unseen by the Bayes classifier, and 
the messages were not used to help craft the regular expressions. Out of those 
795 messages, 600 were positive. 

Interact classified 63% of the whole set of messages in one of 31 different 
categories with a false positive rate of 2%. Interact classified 51.3% of the positive 
messages in 30 different categories (the previous one except the no comment 
category) with a false positive rate of 2.3%. Figure 0 shows the following figures 
for the all-messages and positive messages only cases: 



All messages 
Positive messages 





Correct reject. Miss False pos. Accuracy 



Fig. 4. Performance (in percent) of the Interact classifier 



— Hits: percentage of messages that were classified in the same category by the 
human and the classifier. 

— Correct rejections: percentage of messages that were labelled as “not classi- 
fied” by the human and the classifier. 
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— Misses: percentage of messages that were classified in a category by the 
human, but labelled “not classified” by the classifier. 

— False positives: percentage of messages that were classified in two different 
categories by the human and the classifier. 

— Accuracy: hits + correct rejections: percentage of messages on which the 
classifier and the human agreed. 

4.4 Comparison with Partial Classifier Performances 

To illustrate the interest of adding the multiple threshold mechanism and the 
regular expressions to the basic Bayes classifier, table [2] gives the performance 
of the classifier in four different cases: 

— Bayes classifier only, with one global threshold (basic Bayes classifier, BIT) 

— Bayes classifier only, with multiple thresholds (BMT) 

— Bayes classifier plus regular expressions, one global threshold (BIT+R) 

— Bayes classifier plus regular expressions, multiple thresholds (Interact clas- 
sifier, same as section 031 BMT-I-R) 



Table 2. Performance of different versions of the classifier, with one or multiple 
thresholds (BIT and BMT), with or without regular expressions (R). 





BIT 


BMT 


BlT+R 


BMT+R 


Hits 


41.1 


46.3 


48.6 


51.3 


Correct rejections 


20.3 


20.3 


20.1 


20.1 


Misses 


36.3 


31.1 


29.1 


26.4 


False positives 


2.1 


2.1 


2.3 


2.3 


Accuracy 


61.4 


66.6 


68.7 


71.4 



In each case, the Bayes classifier was trained with the same 6000 messages 
as in section 1431 and, when used, the regular expressions were the same 25 as 
in section [4. 3 1 In each case, the performance figures were computed on the same 
set of 600 positive messages as in section 14.31 

Table m shows that the two modifications: multiple thresholds rather than 
single thresholds (BMT), and addition of regular expressions (BIT-I-R) improve 
the hit rate of the basic Bayes classifier (BIT). The combination of multiple 
thresholds and regular expressions (BMT-I-R) also improves the hit rate of each 
of them (BMT and BlT-l-R) separately. 

In the Interact classifier case (BMT-I-R), the Bayes classifier is responsible 
for 90% of the hits and 92% of the false positives, and the regular expressions are 
responsible for the remaining 10% of the hits (5 percentage points) and 8% of the 
false positives (0.2 percentage points). This shows the validity of our classifier 
compared to a simple Bayes classifier: the adjunction of regular expressions to 
the Bayes classifier significantly improves the hit rate, while degrading the false 
positive rate only slightly. 
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5 Conclusion and Work in Progress 

The Interact system illustrates the benefits of a staged approach to natural lan- 
guage processing. A relatively straightforward technique to set up, classification, 
often gives enough information on messages to let the system answer them. If 
more information is needed, other natural language understanding techniques 
can be applied. 

The installation of the first version of Interact on FTD offers three important 
benefits: 

— Providing customers with feedback and answering their concern immediately 
when possible enhances their shopping experience and demonstrates their 
value to FTD. 

— Interact diminishes the load of the customer service representatives who take 
care of the order-form suggestions by a factor of about two; this is especially 
important in burst-demand situations, to help the company keep a good 
responsiveness . 

— Interact keeps statistics on the number of suggestions in each category and 
their evolution, providing FTD with valuable customer feedback in a sum- 
marized and easy to understand form. 

A web-based interface has been designed that lets the administrator define 
regular expressions, build test and training corpus, train and test the system, 
and define the categories and the automated responses. 

We are currently working on the information extraction and conversation 
management modules of Interact, to enable it to handle messages that cannot 
be answered by a pre-defined reply. We are also continuing work on the classi- 
fier, since this module is at the root of the system; we are in particular exploring 
methods to help the administrator define the regular expressions, or to automat- 
ically build them. 

Our longer term plans include applying artificial intelligence technologies to 
more aspects of electronic commerce. Future versions of Interact will be pervasive 
throughout a website, providing each customer with a personalized interaction 
based on the customer profile, his or her previous interactions with the company, 
and the business process with which he or she is currently involved in. 
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Abstract. We describe an ontology engineering methodology by which concep- 
tual knowledge is extracted from an informal medical thesaurus (UMLS) and 
automatically converted into a formally sound description logics system. Our 
approach consists of four steps: concept definitions are automatically generated 
from the UMLS source, integrity checking of taxonomic and partonomic hierar- 
chies is performed by the terminological classifier, cycles and inconsistencies are 
eliminated, and incremental rehnement of the evolving knowledge base is per- 
formed by a domain expert. We report on knowledge engineering experiments 
with a terminological knowledge hase composed of 164,000 concepts and 76,000 
relations. 



1 Introduction 

Over several decades, an enormous body of medical knowledge, e.g, disease taxonomies, 
medical procedures, anatomical terms etc., has been assembled in a wide variety of 
medical terminologies, thesauri and classification systems. The conceptual structuring 
of a domain they provide is typically restricted to broader/narrower terms, related terms 
or (quasi-)synonymous terms. This is most evident in the UMLS, the Unified Medical 
Language System [8], an umbrella system which covers more than 50 medical thesauri 
and classifications. Its metathesaurus component contains more than 600,000 concepts 
which are structured in hierarchies by 134 semantic types and 54 relations between 
semantic types. Their semantics is shallow and entirely intuitive, which is due to the 
fact that their usage was primarily intended for humans. As one of the most compre- 
hensive sources of medical terminologies it is currently applied for various forms of 
clinical knowledge management, e.g., cross-mapping between different terminologies 
and medical information retrieval. 

Given its size, evolutionary diversity and inherent heterogeneity, there is no sur- 
prise that the lack of a formal semantic foundation leads to inconsistencies, circular 
definitions, etc. [2]. This may not cause utterly severe problems when humans are in 
the loop and its use is limited to disease encoding, accountancy or document retrieval 
tasks. However, anticipating its use for more knowledge-intensive applications such as 
natural language understanding of medical narratives, medical decision support sys- 
tems, etc., those shortcomings might lead to an impasse. Reuse of those rich and large 
resources then requires the elimination of their weaknesses. 



H. Hamilton and Q. Yang (Eds.): Canadian AI 2000, LNAI 1822, pp. 176-186, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 
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As a consequence, formal models for dealing with medical knowledge have been 
proposed, using representation mechanisms based on conceptual graphs, semantic net- 
works or description logics [14, 7, 10]. Not surprisingly, however, there is a price to be 
paid for gains in expressiveness and formal rigor, viz. increasing modeling efforts and, 
hence, increasing maintenance costs. Therefore, concrete systems making full use of 
this rigid approach, especially those which employ high-end knowledge representation 
languages are usually restricted to rather small subdomains. The limited coverage then 
restricts their routine usage, an issue which is always highly rewarded in the medical 
informatics community. 

The knowledge bases developed within the framework of the above-mentioned ter- 
minological systems have all been designed from scratch - without making systematic 
use of the large body of knowledge contained in those medical terminologies. An in- 
triguing approach would be to join the massive coverage offered by informal medical 
terminologies with the high level of expressiveness supported by knowledge represen- 
tation and inferencing systems in order to develop formally solid medical knowledge 
bases on a larger scale. This idea has already been fostered by Pisanelli et al. [9] who 
extracted knowledge from the UMLS semantic network as well as from parts of the 
metathesaurus and merged them with logic-based top-level ontologies from various 
sources. In a similar way, Spackman and Campbell [13] describe how SNOMED [3] 
evolves from a multi-axial coding system into a formally founded ontology. Unfortu- 
nately, the efforts made so far are entirely focused on generalization-based reasoning 
along is-a hierarchies and lack a reasonable coverage of partonomies. 



2 Part- Whole Reasoning 

As far as medical knowledge is concerned, two main hierarchy-building relationships 
can be identified, viz. is-a (taxonomic) and part-whole (partonomic) relations. Unlike 
generalization-based reasoning in concept taxonomies, no fully conclusive mechanism 
exists up to now for reasoning along part-whole hierarchies in description logic sys- 
tems. For medical domains, however, the exclusion of part-whole reasoning is far from 
adequate. Anatomical knowledge, a central portion of medical knowledge, is princi- 
pally organized along part-whole hierarchies. Hence, any proper medical knowledge 
representation has to take account of both hierarchy types [5]. 




Fig. 1. SEP Triplets: Partitive Relations within Taxonomies 
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Various approaches to the reconstruction of part-whole reasoning within object- 
centered representation approaches are discussed by Artale et al. [1]. In the description 
logics community several language extensions have been proposed which provide spe- 
cial constructors for part- whole reasoning [10, 6], though at the cost of increasing com- 
putational complexity. Motivated by informal approaches sketched by Schmolze and 
Marks [12] we formalized a model of part- whole reasoning [4] that does not exceed the 
expressiveness of the well-understood, parsimonious concept language ALC [11].* 

Our proposal is centered around a particular data structure, so-called SEP triplets, 
especially designed for part-whole reasoning (cf. Figure 1). They define a characteristic 
pattern of IS-A hierarchies which support the emulation of inferences typical of transi- 
tive PART-OF relations. In this formalism, the relation ANATOMICAL-PART-OF describes 
the partitive relation between physical parts of an organism. 

A triplet consists, first of all, of a composite ‘structure’ concept, the so-called S- 
node (e.g., Hand-Structure). Each ‘structure’ concept subsumes both an anatomi- 
cal entity and each of the anatomical parts of this entity. Unlike entities and their parts, 
‘structures’ have no physical correlate in the real world — they constitute a represen- 
tational artifact required for the formal reconstruction of systematic patterns of part- 
whole reasoning. The two direct subsumees of an S-node are the corresponding E-node 
(‘entity’) and P-node (‘part’), e.g.. Hand and Hand-Part, respectively. Unlike an 
S-node, these nodes refer to specific ontological objects. The E-node denotes the whole 
anatomical entity to be modeled, whereas the P-node is the common subsumer of any of 
the parts of the E-node. Hence, for every P-node there exists a corresponding E-node for 
the role ANATOMICAL-PART-OF. Note that E-nodes and P-nodes are mutually disjoint, 
i.e., no anatomical concept can be ANATOMICAL-PART-OF itself. 

The reconstruction of the relation ANATOMICAL-PART-OF by taxonomic reasoning 
proceeds as follows. Let us assume that Ce and D e denote E-nodes, Cg and Dg denote 
the S-nodes that subsume Ce and De, respectively, and Cp and Dp denote the P-nodes 
related to Ce and De, respectively, via the role ANATOMICAL-PART-OF (cf. Eigure 1). 
These conventions can be captured by the following terminological expressions: 



CECCgCDpC Dg (1) 

De Q Dg (2) 

The P-node is defined as follows (note the disjointness between De and Dp): 

Dp = Dg n -<De n 3anatomical-part-of.DE (3) 

Since Ce is subsumed hy Dp (1) we infer that the relation ANATOMICAL-PART-OF 
holds between Ce and D e '■ 

Ce U 3anatomical-part-of.DE (4) 



* ACC allows for the construction of hierarchies of concepts and relations, where C denotes 
subsumption and = definitional equivalence. Existential (3) and universal (V) quantification, 
negation (-i), disjunction (13) and conjunction (U) are supported. Role filler constraints (e.g., 
typing by C) are linked to the relation name i? by a dot, 3R.C. 




Towards Very Large Terminological Knowledge Bases: A Case Study from Medicine 



179 



cun 


REL 


CUI2 


RELA 


X 


t I 


C0005847 




C0014261 


part of 


MSH99 


MSH99 


C0005847 




C0014261 








C0005847 






isa 


MSH99 


MSH99 


C0005847 




C0026844 


part of 


MSH99 


MSH99 


C0005847 




C0026844 








C0005847 




C0034052 




SNMI98 


SNMI98 


C0005847 




C0035330 


isa 


MSH99 


MSH99 


C0005847 




C0042366 


part of 


MSH99 


MSH99 


C0005847 




C0042367 


part of 


MSH99 


MSH99 


C0005847 




C0042367 




SNM2 


SNM2 


C0005847 


CHD 


C0042449 


isa 


MSH99 


MSH99 



Fig. 2. Semantic Relations in the UMLS Metathesaurus 



3 Knowledge Import and Refinement 

Our goal is to extract conceptual knowledge from two highly relevant subdomains of 
the UMLS, viz. anatomy and pathology, in order to construct a formally sound knowl- 
edge base using a terminological knowledge representation language. This task will be 
divided into four steps: (1) the automated generation of terminological expressions, (2) 
their submission to a terminological classifier for consistency checking, (3) the manual 
restitution of formal consistency in case of inconsistencies, and, finally, (4) the manual 
rectification and refinement of the formal representation structures. These four steps are 
illustrated by the workflow diagram depicted in Figure 3. 

Step 1: Automated Generation of Terminological Expressions. Sources for con- 
cepts and relations were the UMLS semantic network and the mrrel, mrcon and mrsty 
tables of the 1999 release of the UMLS metathesaurus (cf. Figure 2). The mrrel table 
contains roughly 7,5 million records and exhibits the semantic links between two GUIs 
(concept unique identifier),^ the mrcon table contains the concept names and mrsty 
keeps the semantic type(s) assigned to each CUT These tables, available as ASCII files, 
were imported into a Microsoft Access relational database and manipulated using SQL 
embedded in the VBA programming language. For each CUI in the mrrel subset its 
alphanumeric code was substituted by the English preferred term given in mrcon. 

After a manual remodeling of the 135 top-level concepts and 247 relations of the 
UMLS semantic network, we extracted - from a total of 85,899 concepts - 38,059 
anatomy and 50,087 pathology concepts from the metathesaurus. The criterion for the 
inclusion into one of these sets was the assignment to semantic types corresponding to 
the anatomy and pathology domains. Also, 2,247 concepts were included into both do- 
main sets. Since we wanted to keep the two subdomains strictly disjoint, we maintained 
these 2,247 concepts duplicated, and prefixed all concepts by ANA- or PAT- according 
to their respective subdomain. This can be justified by the observation that these hybrid 
concepts exhibit, indeed, multiple meanings. For instance. Tumor has the meaning of 
a malignant disease on the one hand, and of an anatomical structure on the other hand. 

^ As a coding convention in UMLS, any two GUIs must be connected by at least a shallow 
relation (in Figure 2, CHilD relations in the column REL are assumed between GUIs). These 
shallow relations may be refined in the column RELA, if a thesaurus is available which contains 
more specific information. Some GUIs are linked either by part-of or is-a. In any case, the 
source thesaurus for the relations and the GUIs involved is specified in the columns X and Y 
(e.g., MeSH 1999, SNOMED International 1998). 
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Fig. 3. Workflow Diagram for the Construction of a Loom Knowledge Base from the UMLS 
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As target structures for the anatomy domain we chose SEP triplets. These were 
expressed in the terminological language Loom which we had previously extended 
by a special DEFTRIPLET macro (cf. Table 1 for an example).^ Only UMLS part-of 
has-part and is-a relation attributes are considered for the construction of taxonomic 
and partonomic hierarchies (cf. Figure 3). Hence, for each anatomy concept, one SEP 
triplet is created. The result is a mixed IS-A and PART- WHOLE hierarchy. 



(deftriplet HEART 

:is-primitive HOLLOW- VISCUS 
:has-part (:p-and 

FIBROUS-SKELETON-OF-HEART 

WALL-OF-HEART 

CAVITY-OF-HEART 

CARDIAC-CHAMBER-NOS 

LEFT-SIDE-OF-HEART 

RIGHT-SIDE-OF-HEART 

AORTIC-VALVE 

PULMONARY- VALVE )) 

Table 1. Generated Triplets in P-LOOM Format 

For the pathology domain, we treated CHD (child) and RN (narrower relation) 
from the UMLS as indicators of taxonomic links. No part-whole relations were con- 
sidered, since this category makes no sense in the pathology domain. Furthermore, for 
all anatomy concepts contained in the definitional statements of pathology concepts the 
‘S-node’ is the default concept to which they are linked, thus enabling the propagation 
of roles across the part-whole hierarchy (for details, cf. [4]). 

In both subdomains, shallow relations, such as the extremely frequent sibling SIB 
relation, were included as comments into the code to give some heuristic guidance for 
the manual refinement phase (cf. Figure 3). 

Step 2: Submission to the Loom Classifier. The import of UMLS anatomy con- 
cepts resulted in 38,059 DEFTRIPLET expressions for anatomical concepts and 50,087 
DEFCONCEPT expressions for pathological concepts. Each DEFTRIPLET was expanded 
into three DEFCONCEPT (S-, E-, and P-nodes), and two DEFRELATION (ANATOMICAL- 
PART-OF-X, INV-ANATOMICAL-PART-OF-X) expressions, summing up to 1 14,177 con- 
cepts and 76,1 18 relations. This yielded (together with the concepts from the semantic 
network) a total of 240,764 definitory Loom expressions. 

From 38,059 anatomy triplets, 1,219 DEFTRIPLET statements exhibited a :HAS- 
PART clause followed by a list of a variable number of triplets, containing more than 

^ The UMLS anatomy concepts are mapped to an intermediate format, the reason being that the 
manual refinement of automatically generated LOOM triplets is time-consuming and error- 
prone due to their complex internal structure. Hence, we specified an intermediate representa- 
tion language, P-LOOM, which allows to manipulate the emerging knowledge structures prior 
to converting them to LOOM. P-LOOM provides the full expressiveness of LOOM, enriched 
by special constructors for the encoding of the part-whole relations. The main characteris- 
tics of P-LOOM are the macro DEFTRIPLET (which corresponds to LOOM’s concept-forming 
constructor DEFCONCEPT) and the keywords :PART-OF and :HAS-PART. 
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one argument in 823 cases (average cardinality: 3.3). 4043 DEFTRIPLET statements con- 
tained a :PART-OE clause, only in 332 cases followed by more than one argument (aver- 
age cardinality: 1.1). The obtained knowledge base was then submitted to the termino- 
logical classifier and checked for terminological cycles and coherence. In the anatomy 
subdomain, one terminological cycle and 2,328 incoherent concepts were identified, 
in the pathology subdomain 355 terminological cycles though not a single incoherent 
concept were determined (cf. Table 2). 

Step 3: Manual Restitution of Consistency. The inconsistencies of the anatomy 
part of the knowledge base identified by the classifier could all be traced back to the 
simultaneous linkage of two triplets by both is-a and parf-o/ links, an encoding that 
raises a conflict due to the disjointness required for corresponding P- and E-nodes. 
In most of these cases the affected parents belong to a class of concepts that obviously 
cannot be appropriately modeled as SEP triplets, e.g., Subdivision-Oe-AscENDING- 
Aorta, Organ-Part. The meaning of these concepts almost paraphrases that of a 
P-node, so that in these cases the violation of the SEP-internal disjointness condition 
could be accounted for by substituting the involved triplets with simple LOOM concepts, 
by matching them with already existing P-nodes or by disabling IS-A or PART-OE links. 





Anatomy 


Pathology 


Triplets 


38,059 


— 


DEFCONCEPT 

statements 


114,177 


50,087 


cycles 


1 


355 


inconsistencies 


2,328 


0 



Table 2. Classification Results for the Concept Import 



In the pathology part of the knowledge base, we expected a large number of termi- 
nological cycles to arise as a consequence of interpreting the thesaurus-style narrower 
term and child relations through taxonomic subsumption (IS-A). Bearing in mind the 
size of the knowledge base, we consider 355 cycles a tolerable amount. Those cycles 
were primarily due to very similar concepts, e.g., ARTERIOSCLEROSIS vs. ATHEROSCLE- 
ROSIS, Amaurosis vs. Blindness, and residual categories (“other”, “NOS” = not 
otherwise specified). These were directly inherited from the source terminologies and 
are notoriously difficult to interpret out of their definitional context, e.g., Other-Malig- 
nant-Neoplasm-of-Skin vs. Malignant-Neoplasm-of-Skin-NOS. The cycles 
were analyzed and a negative list which consisted of 630 concept pairs was manually 
derived. In a subsequent extraction cycle we incorporated this list in the automated con- 
struction of the LOOM concept definitions, and given these new constraints, a fully 
consistent knowledge base was generated. 

Step 4: Manual Rectification and Refinement of the Knowiedge Base. This step 
- when performed for the whole knowledge base - is time-consuming and requires 
broad and in-depth medical expertise. We have extracted two random samples (n=100 
each) from both the anatomy and pathology part of the knowledge base; the samples 
were then analyzed by a medical student and a physician. 

Erom the experience we gained in the anatomy and pathology subdomains so far, 
the following workflow steps can be derived: 
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- Checking the correctness and completeness of both the taxonomic and partitive 
hierarchies. Taxonomic and partitive links are manually added or removed. Prim- 
itive subsumption is substituted by nonprimitive one whenever possible. This is a 
crucial point, because the automatically generated hierarchies contain only infor- 
mation about the parent concepts and necessary conditions. As an example, the 
automatically generated definition of Dermatitis includes the information that 
it is an INFLAMMATION and that the role HAS-LOCATION must be filled by the 
concept Skin. An Inflammation that has-location Skin, however, cannot 
automatically be classified as Dermatitis. 

Results: In the anatomy sample, only 76 concepts could be unequivocally classi- 
fied as belonging to ‘canonical’ anatomy. (The remainder, concepts such as ANA- 
Phalanx-of-Supernumerary-Digit-oe-Hand, referring to pathological anat- 
omy were immediately excluded from analysis.) Besides the assignment to the 
semantic types, only 27 (direct) taxonomic links were found. Another 83 UMLS 
relations (mostly “child” or “narrower” relations) were manually upgraded to taxo- 
nomic links. 12 (direct) part-of and 19 /las-parf relations were found. Four part-of 
relations and one has-part relation had to be removed, since we considered them 
as implausible. 5 1 UMLS relations (mostly “child” or “narrower” relations) were 
manually upgraded to parf-o/ relations, and 94 UMLS relations (mostly “parent” 
or “broader” relations) were upgraded to has-part relations. After this workup and 
upgrade of shallow UMLS relations to semantically more specific relations, the 
sample was checked for completeness again. As a result 14 is-a and 37 part-of 
relations were still considered missing. 

In the pathology sample, the assignment to the pathology subdomain was consid- 
ered plausible for 99 of 100 concepts. A total of 15 false is-a relations was identified 
in 12 concept definitions. 24 is-a relations were found to be missing. 

- Check of the :has-part arguments assuming ‘real anatomy’. In the UMLS sources 
part-of and has-part relations are considered as symmetric. According to our trans- 
formation rules, the attachment of a role HAS-ANATOMICAL-PART to an E-node 
Be, with its range restricted io A e implies the existence of a concept A for the 
definition of a concept B. On the other hand, the classification of Ae as being 
subsumed by the P-node Bp, the latter being defined via the role ANATOMIC AL- 
PART-OF restricted to Be, implies the existence of Be given the existence of Ae. 
These constraints do not always conform to ‘real’ anatomy, i.e., anatomical con- 
cepts that may exhibit pathological modifications. Figure 4 (left part) sketches a 
concept A that is necessarily ANATOMICAL-PART-OF a concept B, but whose exis- 
tence is not required for the definition of B. This is typical of the results of surgical 
interventions, e.g., a large intestine without an appendix. 

Results: All 112 has-part relations obtained by the automated import and the man- 
ual workup of our sample were checked. The analysis revealed that more than half 
of them (62) should be eliminated in order not to obviate a coherent classification 
of pathologically modified anatomical objects.^^ The opposite situation is also pos- 

In Table 1 the concepts marked by italics, viz. Aortic-VALVE and PULMONARY- VALVE 

should be eliminated from the :HAS-PART list, because they may be missing in certain cases as 

a result of congenital malformations, inflammatory processes or surgical interventions. 
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Fig. 4. Patterns for Part- whole Reasoning Using SEP triplets 



sible (cf. Figure 4, right part): the definition of Ae does not imply that the role 
ANATOMICAL-PART-OF be filled by Be, but Be does imply that the inverse role 
be filled by Ae- As an example, a Lymph-NODE necessarily contains Lymph- 
FOLLICLES, but there exist Lymph-EOLLICLES that are not part of a Lymph- 
NODE. In our sample, this situation has only been encountered once. 

- Analysis of the sibling relations and defining concepts as being disjoint. In UMLS, 
SIB relates concepts that share the same parent in a taxonomic or partonomic hi- 
erarchy. Pairs of sibling concepts may either have common descendants or not. 
If not, they constitute the root of two disjoint subtrees. In a taxonomic hierarchy, 
this means that one concept implies the negation of the other (e.g., a benignant 
tumor cannot be a malignant one, et vice versa). In a partitive hierarchy, this can 
be interpreted as spatial disjointness, viz. one concept does not spatially overlap 
with another one. As an example, ESOPHAGUS and Duodenum are spatially dis- 
joint, whereas STOMACH and DUODENUM are not (they share a common transition 
structure, called PYLORUS), such as all neighbor structures that have a surface or 
region in common. Spatial disjointness can be modeled such that the definition of 
the S-node of the concept A implies the negation of the S-node of the concept/?. 
Results: We found on the average 6.8 siblings per concept in the anatomy, 8.8 in the 
pathology subdomain. So far, the analysis of sibling relations has been performed 
only for the anatomy domain. From a total of 521 sibling relations, 9 were identified 
as is-a, 14 as part-of and 17 as has-part. 404 ones were found to hold between 
spatially disconnected concepts. 

- Completion and modification of anatomy-pathology relations. For each pathology 
concept (such as determined by the FOOM system after complete classification) it 
has to be checked whether the anatomy-pathology links are correct and complete. 
Incorrect constraints have to be removed from the concept definition itself or from 
the one of the subsuming concepts. For each correct anatomy-pathology relation 
the decision must be made whether the E-node or the S-node has to be addressed as 
the target concept for modification. In the first case, the propagation of roles across 
part-whole hierarchies is disabled, in the second case it is enabled (cf. [4] for a 
more comprehensive discussion of these phenomena). 
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Results: In our random sample we found 522 anatomy-pathology relations, from 
which 358 were considered incorrect by the domain experts. In 36 cases an ade- 
quate anatomy-pathology relation was missing. All 164 HAS-LOCATION roles were 
analyzed as to whether they were to be filled by an S-node or an E-node of an 
anatomical triplet. In 153 cases, the S-node (which allows propagation across the 
part-whole hierarchy) was considered to be adequate, in 1 1 cases the E-node was 
preferred. The analysis of the random sample of 100 pathology concepts revealed 
that only 17 were to be linked with an anatomy concept. In 15 cases, the default 
linkage to the S-node was considered to be correct, in one case the linkage to the 
E-node was preferred, in another case the linkage was considered to be false. 

The high number of implausible constraints points to the undefined semantics of 
HAS-LOCATION links in the UMLS sources. While we interpreted them in terms 
of a conjunction for the import routine, a disjunctive meaning seems to prevail im- 
plicitly in many definitions of top-level concepts such as TUBERCULOSIS. In this 
example, we find all anatomical concepts that can be affected by this disease linked 
by HAS-LOCATION. All these constraints (e.g., HAS-LOCATION Urinary-Tract) 
are inherited to subconcepts such as TuberculosIS-OF-Bronchus. Hence, a 
thorough analysis of the top-level pathology concepts is necessary, and conjunc- 
tions of constraints will have to be substituted by disjunctions where necessary. 



4 Conclusions 

Instead of developing sophisticated medical knowledge bases from scratch, we here 
propose a conservative approach — reuse existing large-scale resources, but refine the 
data from these resources so that advanced representational requirements imposed by 
more expressive knowledge representation languages are met. 

The ontology engineering methodology we have proposed in this paper does ex- 
actly this. It provides a formally solid description logics framework with a modeling 
extension by SEP triplets so that both taxonomic and partonomic reasoning are sup- 
ported equally well. While plain automatic conversion from semi-formal to formal en- 
vironments causes problems of adequacy of the emerging representation structures, the 
refinement methodology we propose already inherits its power from the terminologi- 
cal reasoning framework. In our concrete work, we found the implications of using the 
terminological classifier, the inference engine which computes subsumption relations, 
of utmost importance and of outstanding heuristic value. Hence, the knowledge refine- 
ment cycles are truly semi-automatic, fed by medical expertise on the side of the human 
knowledge engineer, but also driven by the reasoning system which makes explicit the 
consequences of (im)proper concept definitions. 

We also stipulate that our knowledge engineering methodology is general in the 
sense that it can be applied to all those domains where, at least, shallow thesauri are 
available. Our focus on partonomic knowledge and reasoning is, of course, best matched 
by science and engineering domains. 
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Abstract. This paper presents an approach for using ontologies to both 
organize knowledge and facilitate knowledge sharing among service and 
user interface agents in a multi-agent personal communication system. 
These systems are designed to manage a user’s diverse and heterogeneous 
communication services. The nature of this domain and the user require- 
ments of mobile, highly personalized services make a multi-agent so- 
lution particularly attractive. Multi-agent personal communication sys- 
tems have ’’service adapter” agents that communicate with ’’personal 
communication” interface agents that represent the user’s communica- 
tion preferences. The paper describes the use of ontologies for the par- 
titioning of knowledge to facilitate the programming of multi-agent per- 
sonal communication systems and the use of a client/server model for 
knowledge sharing among the service adapter and personal communica- 
tion agents. Met a- knowledge is used for specifying to the personal com- 
munication agents the structure of the knowledge they are to receive 
from the service adapter agents. 



1 Introduction 

There are now many ways to receive personal messages (such as fax, telephone, 
pagers, email, etc.) and a corresponding multitude of ways of responding to 
them. However, such communication systems are rarely integrated, making it 
difficult for messages to pass between systems to reach the recipient. For exam- 
ple, a person usually has several voice-mail accounts; at home, at the office and 
on a cellphone. They may also have a pager and one or more email accounts. 
Each system may have a different user interface, and messages cannot be easily 
exchanged between them. The desire to integrate these multiple communica- 
tions technologies into a single cohesive system has lead to the development of 
systems that integrate multiple communications modalities to provide a single 
seamless service for the user P^. These systems are intended to provide a means 
of configuring a user’s communication services in a way which is personal to the 
recipient, and in which the separate networks that might be used to deliver a 
message are invisible to the user. 
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The distributed and heterogeneous nature of networked communications sys- 
tems together with the personal preferences of the user make a multi-agent ar- 
chitecture an attractive solution for providing an intelligent personal communi- 
cation environment. Several projects have been investigating the use of personal 
agents that can represent the user’s communication needs and help in the ma- 
nipulation of this information to a mobile user. For example, the DUET 
project investigated the use of personal agents in a standard telecommunication 
network. User agents were implemented for the management of mobility in the 
TINA-C project [3- Site Profile Agents were used for managing aspects of per- 
sonal mobility in the NPAS system . To this end, we have developed a Personal 
Agent Mobility Management (PAMM) system that employs personal com- 
munications agents (PCAs), service adapter agents and service-specific gateways 
to accomplish interception, filtering and delivery of multi-modal messages. 

All these systems have in common the desire to customize and configure ex- 
isting communication services or offer new services using the agent paradigm. 
Unfortunately most of the reported information on these systems focus on in- 
frastructure for multi-agent systems rather than discussing how the agents share 
knowledge among themselves. The model used for sharing knowledge among the 
agents is crucial in determining the ease in which the system can manage the 
addition and removal of services. We believe that the use of ontologies as a mech- 
anism for organizing and sharing knowledge can be a step towards facilitating 
service deployment in multi-agent communication systems, but communication 
services are so diverse that finding a common ontology vocabulary will more 
likely be a daunting task. 

This paper describes an approach for knowledge sharing in a personal com- 
munications multi-agent system based on the use of ontologies (for defining the 
context of an agent’s knowledge) and a client/server paradigm for the distribu- 
tion of knowledge among the agents. The paper also describes the use of meta- 
knowledge to specify to the Personal Communication Agents (PCA) a generic 
structure to the events that service adapter agents will present to the PCA. This 
meta-knowledge allows the PCAs to deal with the diversity of the communication 
services. 

2 Ontologies for Agent Knowledge Sharing 

Before agents can begin the process of knowledge sharing, there must be some 
mechanism for classification of knowledge. It is possible to consider the use of 
ontology for the partitioning of knowledge into specific domains. In philosophy, 
ontology is the study of the kinds of things that exist. In artificial intelligence 
(AI), the term ontology has come to mean one of two things [^: 

— A representation vocabulary typically specialized to some domain or subject 
matter. 

— A body of knowledge describing some domain, typically a common sense 
knowledge domain, using such a representation vocabulary. 
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The first definition for ontology is more commonly found in the AI literature 
and focuses on the representation of knowledge for re-usability. It investigates 
the use of terms and relationship in a specific domain and generally requires 
careful analysis into terms, kinds of objects, and relations that exist in the do- 
main. This study into ontology have been used to investigate ways for specifying 
content-specific agreements for the sharing and reuse of knowledge among soft- 
ware entities [7], standard knowledge representation |8] and, recently, as concep- 
tual models for XML documents 0. These ontologies have been driven by the 
desire to develop conventions that facilitate the sharing and re-use of knowledge 
bases and knowledge based systems. Much of the work into ontologies therefore 
is concerned with facilitating consensus of sharable knowledge bases, defining 
common constructs within families of representation languages, and translation 
between different representation languages. 

The second definition applies more to the manner in which ontologies are used 
for this multi-agent system. For example, Lenat et al. m refers to ontology as 
knowledge of some area. It is this definition of ontology that we use for knowledge 
partitioning in our multi-agent systems. Ontologies are used to represent a body 
of knowledge pertaining to a particular domain, i.e. a communication service or 
knowledge for a specific function. 

In order to share the knowledge among the agents in the system, agent brokers 
must be established. Agent brokers act as a knowledge repository and allow 
agents a means of querying for knowledge. Ontologies can be used to organize 
the knowledge for the agent broker, and an agent communication language like 
KQML | 11I14| or the FIFA ACL [1^ can be used as a language by which the 
agents communicate knowledge among themselves. These are mechanisms for 
sharing knowledge among agents, but it is necessary to define an approach for 
deciding how knowledge is partitioned for an agent such that some of it is for 
internal use while other parts are exposed to the broker. We believe that ideas 
from the common used client/server model can be used in this circumstance. 

The client/server model is a common approach for distributed access to in- 
formation. This is where information is maintained in databases on the server 
side of the system, and clients establish a connection to a server and request 
information by either invoking the server’s methods or using an agreed proto- 
col of communication. Agent systems can be more effective than client/server 
models as they support peer-to-peer communication and can therefore engage 
in two-way requests for information. It is still possible to use a client/server 
model for designing the behaviors and interactions among the agents and not use 
client/server protocols for communication. For example, our agents use KQML 
for peer-to-peer agent communication rather than typical HTTP client server 
protocols that do not support server initiated messages. This allows our agent 
system to conform to the requirement that agents need to support peer-to-peer 
communications so as not to degrade the collective task performance of the 
multi-agent system. 

The client/server model is a powerful social model that naturally fits conver- 
sation styles. Agents, while using peer-to-peer communications, behave similarly 
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to client/server objects. In most situations, agents are designed with a particular 
functionality in mind. This function is its role in a multi-agent system and is 
what is exposed to the broker as a service. Therefore these agents behave as a 
server of a particular service to the other agents. The knowledge shared with the 
other agent is client knowledge while the knowledge maintained by the agent 
itself is server knowledge. Personal communication agents are natural clients to 
the services of a service adapter agent. We propose an approach to define this 
separation of knowledge and present an example of how a personal communica- 
tion agent uses this knowledge to program a set of service adapter agents. 

3 PAMM Agents and Ontologies 

In order to achieve a flexible, scalable and adaptable system to manage per- 
sonal communications, a seamless messaging system based on the use of soft- 
ware agents was developed. It utilizes personal communications agents, service 
adapter agents and service-specific gateways to accomplish interception, filtering 
and delivery of multi-modal messages. The agents use the Java Expert System 
Shell (JES^ [ini for internal reasoning and knowledge handling and KQML for 
inter-agent communication. 

Each PAMM agent begins with an identical set of JAVA class files that define 
the skeleton or body of the agent. It provides the backbone for the knowledge 
engine and the CORBA Event Service level for agent communication. The code 
for a generic agent looks like the following: 

import jess.*; 

NullDisplay nd = new NullDisplay () ; 

// Create a Jess engine 
Rete rete = new Rete(nd); 

// Open the JESS knowledge 
FilelnputStrecun fis = new 

FilelnputStrearniCknowledge . clp") ; 

// Create a parser for the file and specify RETE engine 
Jesp j = new JespCfis, rete); 
try { 

// parse and execute the code 
j .parse(false) ; 

} 

catch (ReteException re) { 

// Catch and print all Jess errors 
re .printStackTrace (nd. stderr () ) ; 

} 

What differentiates one agent from the next is its ontology files. Each ontol- 
ogy file consists of a set of facts, rules and helper functions defined in JESS that 
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instruct the agent on how to perform tasks, communicate with another agent or 
interact with an external service. 

Ontologies are used to structure the agent knowledge into domain specific 
knowledge that each agent can use. Some of this knowledge can be public to all 
agents while some can be specific to the agent. For example, a set a JESS rules 
and functions were devised to send and receive KQML messages. These rules are 
collected into a single ontology file and loaded into the knowledge base of each 
agent at the time of instantiation. 

Knowledge that is agent specific is that type of knowledge that determines 
the behaviour of the agent. For the PAMM system it is the JESS rule files that 
define ontologies for particular services. For example, a service adapter agent 
use JESS knowledge that is specific to the services they are managing. Figure |T] 
shows an example of the ontology JESS files that get loaded in a typical PCA 
and adapter agent. 




Fig. 1. Typical JESS knowledge files in PAMM. 



The other standard ontology JESS files that are available are the user in- 
terface ontology to manage presentation of rulesets to the user, the content 
converter ontology to manage content translation services, and the ruleset on- 
tology to manage the creation of rules for the PCA from user preferences. When 
these rule files are loaded into the agent’s knowledge base, a fact is asserted that 
the knowledge for a particular ontology has been loaded. This allows the agent 
to be aware of and to easily enumerate the ontologies it supports. For example, 
when the KQML rules are loaded, the first fact asserted is the following: 

(assert (ontology-supported kqml)). 



4 Client/Server Knowledge Sharing Model 

Beyond the use of ontologies for organizing the agent’s knowledge, we require a 
model for what particular knowledge will be private to the agent and which will 
be exposed to the other agents. This is similar to the concept of public interfaces 



192 



Ramiro Liscano, Katherine Baker, and John Meech 



in object-oriented programming except that we are exposing a set of JESS rules 
and facts rather than methods. The model we use divides the knowledge into 
client and server knowledge. Client knowledge is that which is stored in the 
service ontology database while the server ontology is loaded into an agent as 
its own private knowledge base. An agent behaves as a client when it initiates 
requests from another agent. The other agent then behaves as a server to that 
request. 

If an agent is designed as a client to another agent, then it suffices that the 
server agent does not need to load the client agent’s knowledge since it will 
never initiate communication with the client. This is the typical model between 
the PCA and service adapter agents. The PCA is always a client to the service 
agents in the PAMM multi-agent system. This is because the PCA initiates 
requests from the service adapter agents to manage an event related to the 
communications gateway their are servicing. The service adapter agents do not 
initiate communications with the PCA. Any exceptions or triggers to events do 
not require any knowledge from the client to be passed to it. 

In reality, the agent’s knowledge cannot be divided into two independent 
sets of client and server knowledge. There is always knowledge that is required 
by both sets. Therefore an agent’s knowledge is divided into three parts: client 
knowledge, server knowledge, and shared functions. Figure [2] shows an example 
of what components of the knowledge get loaded by the PCA and service adapter 
agents. 




Fig. 2. Knowledge partitioning between the PCA and adapter agents. 

An example of a service adapter agent is a pager service adapter which is 
relatively simple and allows a client agent to send a ’’page” to a particular 
client. The code below shows the JESS knowledge that defines the pager service 
ontology for the adapter agent. 

; PAGER ONTOLOGY PACKAGE (SHARED) 

(assert (ontology-supported pager)) 

(assert (service-ontology pager) ) 

; Basic information structures 

(def template send-page (slot pagerlD) (slot message)) 
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This part of the JESS code is the common knowledge that is both private 
and public. This component of the knowledge defines common structure elements 
and the ontology supported. In the circumstance of the pager service, there is a 
fact that asserts that this ’’pager” ontology is also a service. The following code 
is the client part of the knowledge. 

; PAGER ONTOLOGY PACKAGE (CLIENT) 

; Trigger FACT: 

(send-page (pagerlD ?id) (message ?msg)) 

; Propagates send-page to a pager adapter. 

(defrule send-pager-msg 
(declare (salience -40)) 

YfactID <- (send-page (pagerlD ?id) 

(message ?msg) ) 

(service-agent (name ?name) 

(serviceSupported pager)) 

=> 

(KQMLTellFact ?name pager ?factID) 

(assert (retract-fact ?factID)) 

) 

; UI facts required 

(assert (action-info (ontology pager) 

(name send-page) (type unordered) 

(slots (create$ pagerlD message)) 

(helpinfo "Sends a pager message"))) 

The ” client” component of the knowledge defines rules that eventually result 
in either a query to the agent or an assertion of a fact into the agent’s knowl- 
edge. For example, the pager client knowledge defines a rule send-pager-msg that 
matches on a send-page fact. The agent also requires knowledge that the service 
agent ’’pager” exists using the fact service-agent. 

In this particular client knowledge there is also a user interface (UI) fact 
named action-info. This is the manner in which each service agent declares the 
’’actions” that they respond to. It is a form of meta-knowledge that is defined 
in the user interface ontology knowledge. 

The following code is the agent knowledge for the pager adapter agent. 

; PAGER ONTOLOGY PACKAGE (SERVER) 

; Load support functions 

(load-function "Agent . Engine . SendPageFn" ) 

; Knowledge-set for sending a page 
(defrule SendPagerMsg 

(declare (salience -40)) 

?fact <- (send-page (pagerlD ?id) 

(message ?msg)) 
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(SendPageFn ?id ?msg) 

(assert (retract-fact ?fact)) 



This agent knowledge is private to the agent. It loads all the required func- 
tions to communicate with external resources and rules that trigger on particular 
facts that in turn result in an interaction with the resource gateways. These rules 
can be similar to those defined by the client knowledge but may also contain 
private information that the developer of the agent does not wish to share with 
other agents in the system. 

The knowledge developed for any service agent in the PAMM system ad- 
dresses primarily that knowledge required to either query a service agent or 
request some particular action from it. In other words, typical to client/server 
models this knowledge allows for the client to instigate the transfer of knowledge. 
This is not sufficient knowledge to allow an agent to register with a service agent 
to receive notifications of particular service events. For this, meta-knowledge is 
used that defines the events that client agent can register for with a service agent. 
This meta-knowledge are structures, i.e. deftemplates in JESS, and belong to the 
user interface ontology. 



5 The User Interface Ontology 

The user interface ontology is a form of meta-knowledge for interface agents that 
defines five structures for encapsulating knowledge that the service agents make 
available to any client agents that wish to register with them. They are part 
of the user interface ontology because the PCA takes the knowledge in these 
structures and presents it to the user as an interface. The meta-knowledge is 
required because the events differ across all the services but they represent the 
same type of information. 

The two structures described in this paper are the fact-info and action-info 
structures. The fact-info defines the structure of a fact that will be sent to the 
PCA when it registers for an event of the same name as that of the fact. In a 
similar manner, the action-info defines the name and structure of a fact that can 
be sent to a service adapter agent. The following are JESS definitions for the 
fact-info and action-info structures: 

(deftemplate fact-info (slot ontology) 

(slot name) (multislot slots) 

(slot helplnfo) (slot volatile)) 

(deftemplate action-info (slot ontology) 

(slot name) (slot type) (multislot slots) 

(slot helplnfo)) 
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Each service event fact will contain the following information: 

— ontology: To what ontology the fact belongs. 

— name: Name of the unordered fact 

— slots: Name of each slot the fact contains. 

— helpinfo: An explanation of what the fact represents. 

— volatile: Boolean to delete the fact after it is processed. 

In a similar manner the service action uses the same slots but also has the 
following different information: 



— type: Either the atom is ordered or unordered. 

— slots: Name of each slot for an unordered or ordered fact. 

This meta-knowledge is used in the following manner. A service agent defines 
in its Service. Client JESS knowledge file the facts that a client agent can register 
for and the actions that it will respond to. The following fact-info is an example 
of the client knowledge of a telephony service adapter agent that defines a fact 
called ”call-info” with several unordered slot names called ”id”, ’’address”, etc.: 



(assert (fact-info (ontology telephony) 

(name call-info) (slots (create! id 
address line state date time bearerMode 
rate mediaMode callerlD calledlD 
redirectinglD origin reason userlnfo 
transferTo ringCount)) 

(helpinfo "Data on an incoming call"))) 

The following code is an example of an action fact that the telephony service 
adapter agent defines. 

(assert (action-info (ontology telephony) 

(name make-call) (type ordered) 

(slots (create! address number pararni)) 

(helpinfo "Places a call"))) 

The PCA, being a user interface agent, presents to the user the following 
interface of service events and actions that can be configured and activated, see 
figures [HI m andIHl 

The user’s choices generate new rules in the PCA that are a combination 
of an customized service event and an actionof another service. This is now 
’’new” knowledge for the PCA. This new knowledge conforms to the structure 
of meta-knowledge defined as rule-sets that belong to the ’’ruleset” ontology. 
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Fig. 3. User interface for service events. 

6 The Ruleset Ontology 

The ruleset ontology is a form of meta-knowledge for service personalization that 
defines knowledge related to the management of the knowledge that gets created 
when a user personalizes their communication services via the PCA interface. It 
defines a meta-knowledge structure called rule-set in the following manner: 

(deftemplate rule-set (slot naune) 

(multislot events) (multislot eventFacts) 

(slot register) (slot unregister)) 

A rule-set fact contain the following information: 

— name: Name of the new rule 

— events: The service event name and attributes. 

— eventFacts: The name of the rule-set fact. 

— register: Rule to register with in the PCA agent. 

— unregister: Rule to un-register in the PCA agent. 

The following code is an example of a rule set created by the user for the 
desire to be paged for any e-mail message to their account. 
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Fig. 4. User interface for service actions. 



(rule-set (narnie EmailNotif yPager) 

(events " (internet-msg (user ?user) 
(MAILserver ?MAILserver) (id ?id) 

(from ?from) (to ?to) 

(subject ?subject)) ; internet-mail") 
(eventFacts internet-msg) 

(register " (def rule EmailNotif yPager 
(internet-msg (user ?user) 

(MAILserver ?MAlLserver) (id ?id) 

(from ?from) (to ?to) 

(subject ?subject)) 

=> 

(assert (send-page (pagerlD \"7608904\") 
(message ?subject))) ; pager") 

(unregister " (undefrule EmailNotifyPager) ") ) 



This rule-set states that the event to trigger on is ” internet-msg” with no 
particular attributes, i.e. from anyone, any subject. The event fact returned 
will be ” internet-msg” . The rule that is registered in the PC A when activated 
is defined in the register slot and is the ’’EmailNotifyPager” rule literally as 
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Fig. 5. Configuration information for a service event. 



shown in quotations in that slot. Similarly to unregister the rule the ’’undefrule” 
command needs to be issued as shown in the unregister slot. 

When a user chooses to activate this rule-set, the PCA determines which 
agent can service the ” internet-msg” event and using the subscribe KQML per- 
formative it creates an ’ask-if’ request for the event and passes this in the con- 
tent field of the KQML subscribe message. The service agent then must create 
a rule that when a match occurs to the particular ” internet-msg” event a ’’tell” 
message is returned to the PCA agent with a JESS fact corresponding to the 
” internet-msg” rule-info structure. This will match the antecedent part of the 
EmailNotifyPager rule in the PCA which will result in the ”send-page” request 
being issued to the pager adapter agent. 

In this manner the PAMM multi-agent system can easily accommodate the 
addition of new services. The user interface is dynamically created when new 
services are introduced to the system. 
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7 Summary 

This paper described in detail how knowledge is shared among agents in a multi- 
agent personal communications system. The system takes advantage of ontolo- 
gies for partitioning knowledge into domain specific areas, introduces the idea 
of using a client server model for distributing the knowledge among the agents, 
and the use of meta-knowledge for facilitating the understanding of the knowl- 
edge passed to interface agents. We are currently exploring several extensions 
to these ideas. One is the use of XML to facilitate the specification of rule-sets, 
user-preferences, communication events, and calendar information. The other is 
the ability to program the adapter agents so that they directly communicate with 
other adapter agents instead of having to communicate back to the PCA all the 
service events. This new effort is known as the IMPAX (Intelligent Messaging 
with Personal Agents and XML) project. 
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Abstract. Genetic Algorithms (GAs) have traditionally been designed 
to work on bitstrings. More recently interest has shifted to the applica- 
tion of GAs to constraint optimization and combinatorial optimization 
problems. Important for an effective and efficient search is the use of 
a suitable crossover operator. This paper analyses the performance of 
six existing crossover operators in the traveling salesman domain. While 
the edge recombination operator was reported to be the most suitable 
operator in the TSP domain, our results suggest that this is only true 
for symmetric TSPs. The problem with edge recombination is that it 
inverts edges found in the parents. This has no negative effect for the 
symmetric TSP but can have a substantial effect if the TSP is asym- 
metric. We propose an edge based crossover operator for the asymmetric 
TSP and demonstrate its superiority over the traditional edge recombi- 
nation. Another interesting finding is that order crossover (OX) which 
has an average performance for symmetric problems, performs very well 
on asymmetric problems. 



Topic Area: Evolutionary Computation, Genetic Algorithms, Search, Selection 
Strategies, Genetic Sequencing Operators, Constrained Ordering Problems, Com- 
binatorial Optimization. 

1 Introduction 

Genetic algorithms are a stochastic search method that can be used for real func- 
tion optimization as well as constraint optimization. GAs are algorithms that 
simulate the natural adaptive process through evolution on a specific artificial 
population of individuals which we call chromosomes. These chromosomes are 
encodings for a candidate solution of the problem at hand. Each such chromo- 
some has associated with it a fitness value which is usually closely related to the 
function that is to be optimized. Unlike other local search algorithms which only 
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consider one candidate solution at a time and search in its local neighborhood 
for improvements, a GA works on a whole population of candidate solutions and 
searches in their hyper-neighborhood. This hyper-neighborhood search is done 
by selecting subsets of the population (called the parents) for reproduction^ ap- 
plying genetic operators to them to produce children and insert these children 
into the next generation. The fitness value of each chromosome is used to deter- 
mine the probability of an individual to be chosen for reproduction. Highly fit 
individuals are more likely to be chosen for reproduction and thus have a higher 
probability to pass on their “good” genetic material. Algorithm [T] shows a high 
level standard GA. 



Top Level Genetic Algorithm 



Initialize a population of chromosomes; 

Evaluate the chromosomes in the population; 

while (stopping criteria not reached) do 
for i=l to sizeof(population)/2 do 
select 2 parent chromosomes; 
apply crossover operator to them; 
apply mutation operator to them; 
evaluate the new chromosomes; 

insert the new chromosomes into the next generation; 
i = i -I- 1; 

endfor 

update stopping criteria; 

endwhile 



Fig. 1. Top Level Genetic Algorithm 



There are two fundamentally different classes of GAs: 

1. binary coded GAs 

2. real coded GAs 

Binary coded GAs use bitstrings to encode candidate solutions. This bitstring is 
fed into a decoder to determine the bitstring’s fitness. Real coded GAs can use 
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numerical values, vectors, lists, arrays, ordered lists or whatever else one prefers 
as a candidate solution representation. In this paper we use a real coded GA 
that uses permutations from 1 to n to represent solutions to a given TSP. 

This paper is organized as follows: We begin with a literature review and move on 
to introduce a new sequencing operator in Sect. [3] that passes on edges from the 
parents to the offspring in such a way that the direction of the edge is preserved. 
In Sect. 12 we compare the operators to each other under standard selection 
(STDS), where parents are chosen by roulette wheel selection and Keep-Best 
Reproduction (KBR), where in addition to roulette wheel selection the pair 
of two parents and two offspring undergo a selection process that keeps the 
best parent and the best child IWie98al . This comparison also emphasizes a 
comparison of ASERC to the traditional edge recombination operator (ERG) 
that also passes on edges but might invert the direction of the edge. In Sect. E] 
we briefly discuss an application study where the task is to find the most efficient 
bus routes in a Manhattan grid. Section E] contains our conclusions. 



2 Review 



Originally, crossover operators in GAs manipulated binary strings by cutting 
the parent strings and recombining those partial strings into new offspring. Such 
recombining is not possible when the parents are permutations rather than bit- 
strings. This would lead to invalid permutations due to duplications and omis- 
sions. One way of dealing with this problem is the inversion operator. Inversion 
randomly chooses two positions in the permutation and inverts the order of the 
entries between those two positions. However, inversion is a unary operator that 
is not able to recombine useful information from two or more permutations. Be- 
fore we start to investigate binary genetic sequencing operators let us consider 
what kind of information is in a permutation that would be useful to process. 
There are mainly three kinds of information that can be found in a permutation: 

— absolute position 
relative order 
adjacency 

To illustrate this, we use an example: 

Example 2.1 Assume we have the parent tour: 

341562 



and the child tour 



1 4 3 2 5 6. 



Without making any assumptions about the workings of the genetic sequencing 
operator (GSO), we can make the following observations: The GSO has preserved 
the absolute position of city 4 . It has preserved the relative order of city 4, 5, 
and 6 and it has preserved the adjacency of 3-j, j-l, and 5-6. 
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The first comparative study of GSOs was published by Oliver et al. pT^ . The 
three operators investigated were: 1) Davis’ Order Crossover (OX), 2) Goldberg’s 
PMX operator and 3) Cycle Crossover (CX) which was newly introduced by 
the authors. Overall, the authors conclude the following operator ranking: OX 
> PMX > CX, with “>” being the relation between GSOs denoting “better 
performance than.” The newly developed cycle crossover, which is a position 
preserving operator, fails to find good results. This is an indication that absolute 
position is not a meaningful search criteria for the TSP. 

In addition to the three GSOs OX, PMX, and CX, used in |Qli87| . the authors of 
|Sta91| use a (modified) edge recombination operator and a modified OX (0X2) 
as well as position based crossover (PBX); which, despite the name is really just 
another order operator. The authors introduce a modification to ERC which 
enables ERC to preserve common subsequences in the parents and pass them 
on to the offspring. All six operators are tested in two domains: 1) A 30 city 
symmetric blind TSP and 2) a warehouse/shipping scheduler. The GENITOR 
system was used for both domains and no mutations were performed. For the 30 
city TSP the ranking of the 6 GSOs was as follows: 

ERC >OX > 0X2 > PBX > PMX > CX 

Maekawa | Mae96| has studied the use of GAs to solve TSPs as well. His study 
also showed that ERC was the superior operator. 

In this paper, we compare the performance of OX, 0X2, PMX, PBX, CX, and 
ERC, as well as our asymmetric edge recombination operator ASERC. The nov- 
elty of this study is that selection strategies are allowed to vary and their influ- 
ence on the performance of the genetic sequencing operators is studied. 

3 Edge Recombination (ERC) and Asymmetric Edge 
Recombination (ASERC) 

In this section, we explain how ERC works and demonstrate that ERC is not 
an appropriate operator for non-symmetric problems. We illustrate how an op- 
erator for asymmetric problems can be constructed that preserves edges and 
their direction from parent tours. For the remainder of this chapter, let P\ and 
P 2 denote the first and second parent and Oi and O 2 denote the first and the 
second offspring. 

3.1 Edge Recombination Operator, ERC 

Originally proposed by D. Whitley et al. [Whi89a,j . ERC is different from the 
above operators in the sense that it emphasizes and preserves edges (and thus 
adjacency) rather than absolute position or relative order. First, a so-called “edge 
table” is built that contains a row for each element c of the sequence. This row 
contains all the elements that c has edges to. From this table two offspring tours 
are constructed that contain only edges from the two parent tours. However, 
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this approach is problematic if the underlying problem is not symmetric. ERG 
might invert an edge a-b found in a parent to become b-a in an offspring. In a 
symmetric domain this is not a problem, but in an asymmetric domain this can 
have a very negative influence on the solution quality. 

3.2 Asymmetric Edge Recombination Operator, ASERC 

From the literature review in Sect. |2[ we know that ERG performed very well 
on symmetric TSPs when compared to other sequencing operators. It, however, 
performed poorly on scheduling problems. It was argued by the authors of |Sta9I| 
that this under-performance of ERG on scheduling problems is due to the nature 
of scheduling: relative order is more important in scheduling than is adjacency. 
We also think that this is true, but believe that another factor contributed to 
the poor performance of ERG. The fact that ERG inverts edges, because it 
assumes that a — b has the same cost as b— a might also contribute to the under- 
performance in the scheduling domain, because it makes a difference whether or 
not a is scheduled before b. Even worse, if we have an asymmetric TSP then the 
cost of an inverted edge could be totally different from the original one. Most 
real world problems are asymmetric. For example, commuting into or out of the 
city during rush hour will take different amounts of time and gas consumption 
will vary. Traveling uphill will require more energy than going downhill. In the 
worst case, an inverted edge has a cost of oo, because there might not be a direct 
way back (e.g., one way traffic). 

We do believe that adjacency is very important in the TSP, and therefore propose 
an edge preserving operator that preserves the edges and the direction of these 
edges. This operator can then be used to solve asymmetric TSPs. We call this 
operator asymmetric edge recombination or ASERC. 

ASERG proceeds as follows: 1) An edge table is built similar to the one for 
ERG. However, care is taken, that in the link list for element a only elements 
are inserted that have an incoming edge from element a in one of the parents. 
Elements that have outgoing edges to a are not inserted into the link list for a. 
If we consider the example sequence [3 2 1 4 5 6], then the link list for element 
4 would contain 5 only, but not 1. 2) From this edge table an offspring tour is 
constructed in a similar way as this was done for ERG. 

Let us look at an example: 

Pi = 1 2 3 4 5 6 
P 2 = 3 4 5 2 6 1 

This produces the following edge table and offspring: 



1: 


2 


3 


2: 


3 


6 


3: 


4 




4: 


5 




5: 


6 


2 


6: 


1 
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01 = 1 3 4 5 2 6 

0 2 = 4 5 6 1 2 3 

Note that either one or two entries are in each link list. In the case where only one 
element is in the link list then this list represents a common edge. We do not need 
to explicitly take care to preserve common subtours by flagging elements with 
” signs in the edge table as this was the case for ERG. ASERC will automatically 
preserve common subsequences found in the parents. For example, the common 
subsequence [3 4 5] is preserved in the offspring as well as the subsequence 
(directed edge) [6 1]. We conclude our discussion of ASERC with the following 
remarks: 

— ASERC preserves edges and their direction. 

— No modifications to preserve common subsequences are needed. ASERC 
automatically preserves those common subsequences. Unlike ERG, ASERC 
does not invert the common subsequences or part thereof. 

— ASERC is specifically designed to work on asymmetric problems. Technically, 
it does, however, also work on symmetric ones. 

4 Empirical Comparison of the Sequencing Operators 

In this section, we compare PMX, OX, 0X2, PBX, CX, SYMERC, and ASERC 
on a 100 city asymmetric problem. The operators are tested under STDS and 
KBR. We rank the operators for each selection strategy and also rank the selec- 
tion strategies with respect to each operator. The first part of the comparison 
does not use mutation, so that we can compare the GSOs in isolation. Later, 
mutation is added to achieve the best possible results. 

Pc denotes the probability that two chromosomes chosen for reproduction will 
actually undergo crossover {Pm is defined similarly for mutation). 10 different 
crossover probabilities Pc and 10 different mutation probabilities Pm are tested. 
The population size was 600 and the number of generations was set to 600. All 
results are average results from 20 independent runs with different random seeds. 

4.1 Recombination Alone 

The following is a ranking of the different GSOs under STDS and KBR: 

— STDS: OX > PMX > CX > ASERC > ERG > PBX > 0X2 

— KBR : OX > ASERC > PMX > PBX > 0X2 > ERC > CX 

This is consistent with Oliver et al.’s findings |Q]i87] : He found the following 
ranking: OX > PMX > CX. Our results differ slightly from Whitley’s findings 
[Whi80a,| : He found the following ranking: ERC > OX > 0X2 > PBX > PMX > 
CX. While Whitley ranks ERC the highest, the operator showed a rather poor 
performance in our tests. This is mainly due to our problems being asymmetric. 
Also Whitley’s algorithm differs from ours by using a steady state replacement 
technique, while we use generational replacement. 
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Table 1. Cost of cheapest solution found, recombination only, i is the number 
of generations until the population has converged, i.e., all fitness values are the 
same 





STDS 


KBR II 


GSO 


cost 


Pc 


i 


cost 


\Pc 


i 


PMX 


151243 


0.6 


500 


211290 


1.0 


80 


OX 


121315 


0.2 


600 + 


64606 


|0.9 


600 


0X2 


228343 


0.6 


600 + 


229878 


0.8 


120 


PBX 


223405 


0.5 


540 


229417 


0.9 


120 


GX 


187382 


1.0 


180 


319107 


1.0 


40 


ERC 


196999 


0.5 


600 + 


262936 


1.0 


120 


ASERG 


195929 


0.3 


600 + 


155753 


1.0 


180 



While the newly proposed GSO ASERC does not show any significant improve- 
ment over ERC when STDS is used, it shows a substantial performance im- 
provement with KBR. With KBR, ASERC ranks second while ERC ranks sixth. 
However, for both STDS and KBR the operator with the overall best perfor- 
mance is OX. 

4.2 Recombination and Mutation 

The addition of mutation to the GA will change the GA’s behavior substantially. 
In this subsection we have run tests for all GSOs and have allowed Pc and Pm 
to vary over a wide range. 

PMX 

The best results are found with Pc < 0.6. Increasing Pc worsens the tours found 
by PMX. KBR finds excellent tours for all settings of Pc- Overall, KBR finds 
much better tours than STDS when PMX is used. 

OX 

OX yields the best tours for Pc < 0.3 for STDS. Increasing Pc beyond 0.4 leads 
to very bad tours. The best tours are found with very low mutation probabilities, 
in most cases Pm = 0.00. This is not surprising, since OX is a very disruptive 
operator. Adding more disruption with mutation will only worsen the perfor- 
mance of STDS. As expected, OX’s performance increases proportional to Pc 
for KBR. The best tour is found for Pc = 1.0. It is just about half as expensive 
as the cheapest tour that STDS can find. 

0X2 

0X2 does not deliver good results for STDS. The cheapest tour has cost 210,098. 
Compared with the other GSOs this is the second worst result. Increasing Pc 
beyond 0.5 dramatically worsens the tours. When KBR is used, 0X2 does much 
better, also when compared to the other GSOs. This again suggests that KBR, 
with higher mutation rates can lead to substantial performance increases, even 
if the underlying GSO performs rather poorly under STDS. 
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Table 2. STDS, recombination and mutation. Cost of the cheapest tour found 
after i generations, using different genetic sequencing operators and different 
settings for Pm and Pc 



Pc 


Genetic Sequencing Operator 




PMX 


OX 


0X2 


PBX 


cx 


ERC 


ASERG 


i 


i 


i 


i 


i 


i 


i 


P 

J- m 


P 

J- m 


P 

J- m 


P 

m 


P 

m 


P 

J- m 


Pm 


0.1 132902 135823 219811 214972 144108 231341 


173065 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


0.09 


0.02 


0.09 


0.07 


0.10 


0.07 


0.03 


0.2 129518 120329 215169 217797 143023 234493 


185428 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


0.09 


0.01 


0.05 


0.02 


0.09 


0.03 


0.03 


0.3 126632 124288 214482 221521 139382 220084 


190951 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


0.08 


0.00 


0.04 


0.03 


0.08 


0.04 


0.04 


0.4 129500 139666 211456 216888 137139 206488 


214128 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


0.09 


0.00 


0.03 


0.03 


0.10 


0.09 


0.00 


0.5 128546 217274 210098 211025 138137 177846 


298528 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


0.09 


0.00 


0.02 


0.01 


0.07 


0.04 


0.02 


0.6 127652 296224 228343 247220 135158 205672 


341104 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


0.06 


0.06 


0.00 


0.00 


0.10 


0.00 


0.00 


0.7 168065 329637 361301 358251 132692 222176 


359235 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


0.03 


0.07 


0.01 


0.02 


0.08 


0.00 


0.01 


0.8 352066 343763 378369 376982 130883 234668 


366308 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


0.01 


0.00 


0.00 


0.05 


0.10 


0.02 


0.01 


0.9 376666 355980 384810 384599 128849 245013 


369925 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


0.02 


0.00 


0.04 


0.05 


0.10 


0.02 


0.04 


1.0 379283 363617 388872 388175 125010 255354 


377266 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


600 + 


0.05 


0.00 


0.01 


0.00 


0.1 


0.03 


0.03 
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Table 3. KBR, recombination and mutation. Cost of the cheapest tour found 
after i generations, using different genetic sequencing operators and different 
settings for Pm and Pc 



Pc 


Genetic Sequencing Operator 




PMX 


OX 


0X2 


PBX CX ERG 


ASERC 


i 


i 


i 


i i i 


i 


P 

J- m 


P 

J- m 


Pm 


P P P 

J- m J- m J- m 


Pm 


0.1 86770 87294 


90993 


91637 86769 91915 


87646 


600 + 600 + 600 + 600 + 600 + 600 + 


600 + 


0.7 


0.8 


0.4 


0.4 1.0 0.7 


0.5 


0.2 88151 


84789 


86793 


91852 89005 91268 


88106 


600 + 600 + 600 + 600 + 600 + 600 + 


600 + 


0.6 


0.4 


0.4 


0.5 1.0 0.5 


0.7 


0.3 87615 


82443 


91473 


92113 87150 90511 


85200 


600 + 600 + 600 + 600 + 600 + 600 + 


600 + 


0.8 


0.5 


0.7 


0.4 0.9 0.7 


0.5 


0.4 85541 


78916 


91749 


92548 87340 88940 


87067 


600 + 600 + 600 + 600 + 600 + 600 + 


600 + 


0.5 


0.4 


0.5 


0.4 0.9 0.3 


0.4 


0.5 86023 


78408 


93159 


93796 87260 88284 


84166 


600 + 600 + 600 + 600 + 600 + 600 + 


600 + 


0.8 


0.4 


0.3 


0.4 0.9 0.4 


0.4 


0.6 86490 


72557 91251 


90501 85738 89760 


83836 


600 + 600 + 600 + 600 + 600 + 600 + 


600 + 


0.8 


0.1 


0.4 


0.4 0.8 0.4 


0.3 


0.7 87622 


68282 


93433 


93441 87800 88705 


84605 


600 + 600 + 600 + 600 + 600 + 600 + 


600 + 


1.0 


0.3 


0.3 


0.3 0.6 0.5 


0.6 


0.8 87575 


63471 


93333 


94840 89337 89379 


81726 


600 + 600 + 600 + 600 + 600 + 600 + 


600 + 


0.7 


0.1 


0.4 


0.4 1.0 0.3 


0.5 


0.9 87673 


64035 


97384 


95601 86656 86362 


81359 


600 + 600 + 600 + 600 + 600 + 600 + 


600 + 


0.5 


0.1 


0.4 


0.3 1.0 0.5 


0.3 


1.0 87898 


61129 


97537 96103 88631 88341 


82740 


600 + 600 + 600 + 600 + 600 + 600 + 


600 + 


0.8 


0.1 


0.3 


0.4 0.5 0.5 


0.4 
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PBX 

As in the case where only recombination was used, PBX behaves very similarly 
to 0X2. The results are generally poor for STDS, with the best tour found 
for Pc = 0.5. PBX has the worst solution quality of all GSOs for STDS. As 
for 0X2, increasing Pc dramatically worsens the tours. When KBR is used, the 
results are much better than with STDS, but still, PBX delivers the worst results 
of all GSOs. 

CX 

As we mentioned in Subsect. 14.11 GX behaves very differently from the other 
GSOs. First, when STDS is used, the solution quality increases proportional with 
the crossover probability Pc- This is unusual, since usually too high a crossover 
probability is too disruptive for STDS to produce good overall performance. 
However, as we noted before, GX does not seem to be a very disruptive operator. 
This notion is also supported by the fact, that GX benefits from higher mutation 
rates than the other GSOs. Again, KBR substantially improves the results of 
GX. Also, interesting is the fact that, with KBR, the best results are achieved 
with high mutation probabilities. 

ERG 

Under STDS, ERG delivers poor results as well. The overall best tour is not as 
bad as with 0X2 and PBX, but much worse than PMX, OX, and GX. Increasing 
Pc improves the quality of the tours up to Pc = 0.5. Further increasing Pc 
worsens the tours. With KBR, the results are again significantly better, and the 
best result with KBR is achieved with a crossover probability of Pc = 0.9. 
ASERC 

ASERG does not perform very well under STDS, although its performance is 
better than the one of ERG. Interesting is the fact, that ASERG finds the best 
solution for very low settings of Pc- In fact, for the 100 city problem the best 
solution was found with Pc = 0.1. Again, the use of KBR leads to great improve- 
ments in solution quality. In fact, ASERG is the second best GSO under KBR. 
Only OX finds better solutions. 



Table 4. Gost of cheapest solution found, recombination and mutation. The 
values for Pc and Pm are the ones yielding the best results. 





STDS 


KBR 


GSO 


cost 


Pc 


Pm 




cost 


Pc 


Pm 


i 


PMX 


127652 


0.6 


0.06 


600-t 


85541 


0.4 


0.5 


600+ 


OX 


120329 


0.2 


0.01 


600-t 


61129 


1.0 


0.1 


600+ 


0X2 


210098 


0.5 


0.02 


600-t 


86793 


0.2 


0.4 


600+ 


PBX 


211025 


0.5 


0.01 


600-t 


90501 


0.6 


0.4 


600+ 


CX 


125010 


1.0 


0.1 


600-t 


85738 


0.6 


0.8 


600+ 


ERG 


177849 


0.5 


0.04 


600-t 


86362 


0.9 


0.5 


600+ 


ASERC 


173065 


0.1 


0.03 


600-t 


81359 


0.9 


0.3 


600+ 
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The following is the overall ranking of the genetic sequencing operators for STDS 
and KBR using optimal parameter settings for both and Pm- 

- STDS: OX > CX > PMX > ASERC > ERC > 0X2 > PBX 

- KBR : OX > ASERC > PMX > CX > ERC > 0X2 > PBX 



5 An Application: Bus Routing (BR) 

So far, we have studied the asymmetric TSP in general, which is the mathe- 
matical representation of many real world problems including bus routing (BR). 
During the course of my program at the University of Regina, I had the opportu- 
nity to mentor a student on a project involving finding cost effective bus routes 
using GAs |Da,r99J . This section focuses on the bus routing problem (BRP) in 
more detail, shows the similarities and differences between the BRP and our 
random asymmetric TSPs and discusses the results of |Dar 99]. 

The BRP investigated in [Par Ml routes a bus through a set of stops with 
weighted stop times, i.e., the time the bus waits at each stop is added to the 
total tour cost. The different weights can be used to simulate longer stop times 
during peak traffic hours. This is an augmentation of an instantiation of the 
symmetric TSP. In addition, a Manhattan city block distance metric is used in- 
stead of the Euclidean distance metric that is usually used for symmetric TSPs. 
This of course makes sense for many North American city cores. For the BRP, 
Darrah used the following operators: 

— Selection: STDS and KBR 

— Crossover: 0X2, ERC, and ASERC 

— Mutation: SUBLIST and SWAP 

SUBLIST mutation works as follows: 1) Randomly choose two positions in the 
permutation. 2) Permute the elements in the sublist between the two positions 
chosen in step 1). Darrah noticed that SUBLIST mutation was too disruptive 
for his problems and did not include it in his final results. 

Darrah found that KBR was the superior selection strategy yielding superior 
results for all cases. He also noted that KBR works best with high settings 
for Pc and Pm and needs fewer generations to converge than STDS does. The 
GSO that yielded the overall best performance was ERC. It should not come as 
a surprise that ERC outperformed ASERC on the bus routing task. Darrah’s 
distance metric is the Manhattan distance without directional constraints, i.e., 
the problem is symmetric. 

6 Conclusion 

The newly proposed ASERC operator ranks second best for KBR and also shows 
better performance than ERC under STDS. The overall best operator was OX. 
KBR was able to deliver the overall best results for all GSOs. Interesting is 
the fact, that while performance varied widely for STDS between the different 
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GSOs, the results were much closer together for KBR. This suggests that KBR 
is not only more robust with regard to population size and operator probabili- 
ties, but also with regard to the sequencing operator used to do the crossover. 
The newly proposed ASERC performed very well in comparison to the other 
GSOs. Only OX managed to outperform it. ASERG found better tours than 
the other adjacency based crossover (ERG) and is thus an improved adjacency 
based operator for asymmetric TSPs. When the problem is symmetric, like the 
bus routing problem from the previous section is, ERG should be used since it 
outperforms ASERG. OX’s superior performance is itself an interesting result. 
While on symmetric problems OX is outperformed by ERG our results suggest 
that OX is a very good operator for asymmetric TSPs. 
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Abstract. We introduce an interactive visualization system, CViz, for 
rule induction. The process of learning classification rules is visualized, 
which consists of five components: preparing and visualizing the original 
data, cleaning the original data, discretizing numerical attributes, learn- 
ing classification rules, and visualizing the discovered rules. The CViz 
system is presented and each component is discussed. Three approaches 
for discretizing numerical attributes, including equal-length, equal-depth, 
and entropy-based approaches, are provided. The algorithm ELEM2 for 
learning classification rules is introduced, and the approaches to visualiz- 
ing discretized data and classification rules are proposed. CViz could be 
easily adapted to visualize the rule induction process of other rule-based 
learning systems. Our experimental results on the IRIS data. Monks 
data, and artificial data show that the CViz system is useful and helpful 
for visualizing and understanding the learning process of classification 
rules. 

Keywords: Interactive visualization, knowledge discovery, machine learn- 
ing, rule induction. 



1 Introduction 

Interactive visualization techniques allow people to visualize the results of pre- 
sentations on the fly in different perspectives, and thus help users understand 
the discovered knowledge better and more easily. This interactive process makes 
the knowledge discovery process straightforward and accessible. 

Many techniques and systems for data visualization have been developed and 
implemented 1417181111 . One common feature of these business systems is their 
dependence on computer graphics and scientific visualization. Most existing vi- 
sualization systems lack the ability to visualize the entire process of knowledge 
discovery. The complex data is carefully arranged and displayed in a specific 
visual form, and the knowledge underlying the original data is left to the users 
who must observe and determine the meaning of the pictures. This determination 
usually requires a wealth of background knowledge. Silicon Graphics developed 
a series of visualizers like Map Visualizer, Tree Visualizer, etc. [7] to visual- 
ize the knowledge discovered according to different techniques such as decision 
trees, neural networks, etc. but only the results are displayed. Interactive visual 
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knowledge discovery should provide a user with not only the discovery results 
but also the entire process in a visual form so that the user can participate in 
the discovery process and understand the knowledge discovered. 

The CViz system stresses the visualization of the classification rule discov- 
ery. It consists of five components, including original data visualization on the 
parallel coordinates system pronu, interactive data reduction (horizontally and 
vertically) by removing some attributes and/or attribute values, numerical at- 
tribute discretization and visualization with three approaches, rule induction 
using learning algorithm ELEM2 [^, and rule visualization. 

The framework of the CViz system and its components are presented in 
Section 2. Three approaches to discretizing numerical attributes are discussed in 
Section 3. In Section 4, the learning algorithm ELEM2 used in the CViz system 
is introduced. The rules discovered by ELEM2 can be displayed on screen in 
visual forms, which is the topic of Section 5. In Section 6, the experiment results 
with artificial data and UCI data are showed. Related work is outlined in Section 
7, and Section 8 is the concluding remarks. 



2 The CViz System 

CViz is an interactive knowledge discovery visualization system, which uses in- 
teractive visualization techniques to visualize the original data, help the user 
clean and preprocess the data, and interpret the rules discovered. CViz also con- 
ducts numerical attribute discretization and rule induction to learn classification 
rules based on the training data set. CViz consists of five components, shown in 
Fig.[H 

In CViz, the original data is visualized based on parallel coordinates technique 
El- Suppose the training data are represented as n-ary tuples, n equidistant 
axes are created to be parallel to one of the screen axes, say Y-axis, and corre- 
spond to the attributes. The axes are scaled to the range of the corresponding 
attributes and normalized, if necessary. Every tuple corresponds to a polyline 
which intersects each of the axes at the point that corresponds to the value for 
the attribute mg. Fig. Eshows the effects of using parallel coordinates technique 
to visualize multidimensional data sets. Each coordinate is interpreted further 
with a list of values for a categorical attribute, while a numeric attribute can 
be attached with the minimum and maximum, even mean, variance, etc. As the 
algorithm proceeds, numeric intervals can be added, including the start point 
and the end point for each interval and the number of total intervals once the 
numeric attribute is discretized. 

The second component is to reduce the original data, if necessary. Once the 
original data is visualized on the parallel coordinates, the user can get a rough 
idea about the data distribution so that data cleaning might be performed. 
The data reduction involves two aspects: horizontal and vertical reduction. An 
attribute that the user thinks irrelevant to the knowledge discovery can be inter- 
actively removed by deleting-click, thus the data is horizontally reduced. While 
an attribute value can be deleted if the distribution of the data tuples passing 
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Fig. 1. The CViz System 



the value is too sparse, thus the data tuples that have the value could be removed 
so that the data is vertically reduced. The data reduction is highly dependent 
on the user |8]. 

The third component is to discretize numerical (continuous) attributes. The 
CViz system provides three approaches: equi-length, equi-depth, and entropy- 
based methods jS]. The user can select the method that he/she likes, or select 
all of them for different learning runs for comparison of the learning results. 
The discretized attributes are visualized again where an attribute interval cor- 
responds to a point on the attribute coordinate. Fig. 0 shows the discretized 
parallel coordinates. 

Rule induction and visualization are the last two components. The ELEM2 
learning algorithm is used to learn classification rules [2], while the final rules 
are visualized as colored strips (polygons) . Each rule is illustrated by a subset of 
coordinates with corresponding values and/or intervals. The coordinates in the 
subset are connected together through the values or intervals. The rule accuracy 
and quality values are used to render the rule strips. Fig. |7] through Fig. [12] 
illustrate a part of the classification rules obtained in our experiments. 

3 Discretizing Numerical Attributes 

The CViz system provides three methods to discretize numerical attributes, 
equi-length, equi-depth, and entropy-based approaches. 

The equi-length approach partitions the continuous domain into intervals with 
equal length. For example, if the domain of attribute age is [0,100], then it 
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can be divided into small intervals of length 10, thus we have age intervals 
[0, 10], (10, 20], (20, 30], ... , (90, 100]. This approach is can be easily implemented. 
Its main drawback is that many useful rules may be missed since the distribution 
of the data values is not considered. 

The second discretization approach used in CViz is called bin-packing based 
equi-depth approach, which is different from existing equi-depth approaches. The 
domain of the numerical attributes may contain an infinite number of points. To 
deal with this problem, KIDS employs an adjustable buckets method ca, while 
the approach proposed in m is based on the concept of a partial completeness 
measure. The drawback of these approaches is in time-consuming computation 
and/or large storage requirements. CViz exploits a simple and direct method, 
which is described in Fig. 

Assume the window size used to visualize the data set is M (width or height) 
in pixels, and each pixel corresponds to a bin. Thus we have M bins, denoted 
B[i],i = 0, . . . , M — 1. We map the raw data tuples to the bins in terms of the 
mapping function. Suppose B[i\ contains T[i] tuples, and further, the attribute 
is to be discretized into N buckets. According to the equi-depth approach, each 
bucket will contain d = tuples. We first assign B[0], B[l], . . ., to 

the first bucket until it contains at least d tuples, and then assign the following 
bins to the second bucket. We can repeat this process until all buckets contain 
a roughly equal number of tuples. 



PrOCGdurG for bin-packing discretization 

j=0; 

for {i = O', i < N ■, i 
Bucket[i] = 0; 
for {k = j,K < M, k 

Bucket[i]-\- = T[j -|-]; 
if Bucket[i] > d 
break; 



Fig. 2. Bin-packing based Equi-depth Discretization 



The storage requirement in this approach is 0{M -\- N), depending on the 
number of buckets and the size of the visualization window, regardless of the 
domain of the attributes and the size of the data set. This approach does not 
need to sort the data and the execution time is linear in the size of the data set. 
This method, however, may not produce enough buckets, because each bin must 
be assigned to only one bucket, and cannot be further divided. For instance, if 
the data concentrates in several bins, then the buckets that contain these bins 
will contain many more tuples than others. This case could happen especially 
when the visualization window has a small size. 

The third approach uses EDA-DB (Entropy-based Discretization According 
to Distribution of Boundary points) method [ 3 ]. Unlike solely entropy-based 
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discretization methods (such as the method based on the Minimum Description 
Length Principle E), EDA-DB first divides the value range of the attribute into 
several big intervals and then selects in each interval a number of cut-points 
based on the entropy calculated over the current entire data set. The number of 
cut-points selected for each interval is determined by estimating the probability 
distribution of the boundary points over the data set. The maximum number of 
selected cut-points is determined by the number of class labels and the number 
of distinct observed values for the continuous attribute. 

Let I be the number of distinct observed values for a continuous attribute A, 
b be the total number of boundary points for A, and k be the number of classes in 
the data set. Then the discretization of A is described in the procedure illustrated 
in Fig. |3] 



PrOCGdurG for EDA-DB discretization 

1. Calculate: m = max{2, k * log 2 {l)}. 

2. Estimate the probability distribution of boundary points: 

(a) Divide the value range of A into d intervals, where d = max{l,loge{l)}- 

(b) Calculate the number bi of boundary points in each interval ivi, where 

i = 1, 2, • ■ • , d and bi = b. 

(c) Estimate the probability of boundary points in each interval ivi {i = 
1, 2, • • • , d) as Pi = bi/b. 

3. Calculate the quota qi of cut-points for each interval ivi {i = 1, 2, • ■ • , d) ac- 
cording to m and the distribution of boundary points: qi = pi* m. 

4. Rank the boundary points in each interval ivi {i = 1,2, • • • ,d) by increasing 
order of the class information entropy of the partition induced by the boundary 
point. The entropy for each point is calculated globally over the entire data 
set. 

5. For each interval ivi {i = 1,2, •••,d), select the Erst qi points in the above 
ordered sequence. A total of m cut-points are selected. 



Fig. 3. EDA-DB discretization of continuous attribute 



4 LGarning Classification RuIgs 

The CViz system exploits ELEM2 as the approach to discovering knowledge. 
ELEM2 is a rule induction system that learns classification rules from a set 
of data |2]. Given a set of training data, ELEM2 sequentially learns a set of 
rules for each of classes in the data set. To induce rules for a class C, ELEM2 
conducts general-to-specific heuristic search over a hypothesis space to generate a 
disjunctive set of propositional rules. ELEM2 uses a sequential eovering learning 
strategy; it reduces the problem of learning a disjunctive set of rules to a sequence 
of simpler problems, each requiring that a single conjunctive rule be learned that 
covers a subset of positive examples. The learning of a single conjunctive rule 
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begins by considering the most general rule precondition, ( e.g., the empty test 
that matches every training example,) then greedily searching for an attribute- 
value pair that are most relevant to the class label C according to the following 
attribute- value pair evaluation function: 

SIGc{av) = P{av){P{C\av) - P{C)) 

where av is an attribute- value pair and P denotes probability [2]. The selected 
attribute- value pair is then added to the rule precondition as a conjunct. The 
process is repeated by greedily adding a second attribute-value pair, and so on, 
until the hypothesis reaches an acceptable level of performance. In ELEM2, the 
acceptable level is based on the consistency of the rule: it forms a rule that is 
as consistent with the training data as possible. Since this consistent rule may 
be a small disjunct that overfits the training data, ELEM2 may post-prune the 
rule after the initial search for this rule is complete. 

To post-prune a rule, ELEM2 first computes a rule quality value according to 
a formula that measures the extent to which a rule R can discriminate between 
the positive and negative examples of class label C: 

P{R\C){1-P{R\C)) 

^ P{R\C){1 - P{R\C))' 

ELEM2 then checks each attribute-value pair in the rule in the reverse order 
in which they were selected to see if removal of the attribute-value pair will 
decrease the rule quality value. If not, the attribute-value pair is removed and 
the procedure checks all the other pairs in the same order again using the new 
rule quality value resulting from the removal of that attribute- value pair to see 
whether another attribute- value pair can be removed. This procedure continues 
until no pair can be removed. 

After rules are induced for all the classes, the rules can be used to classify 
new examples. The classification procedure in ELEM2 considers three possible 
cases when a new example matches a set of rules. 

— Single match. The new example satisfies one or more rules of the same class. 
In this case, the example is classified to the class indicated by the rule(s). 

— Multiple match. The new example satisfies more than one rule that indicates 
different classes. In this case, ELEM2 activates a conflict resolution scheme 
for the best decision. 

— No match. The new example is not covered by any rule. In this case, the 
decision score is calculated for each class and the new example is classified 
into the class with the highest decision score. 

5 Rule Visualization 

The basic idea of visualizing classification rules in the CViz system is to repre- 
sent a rule as a strip, called rule polygon, which covers the area that connetcs 
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the corresponding attribute values. A classification rule induced by the ELEM2 
algorithm is a logical statement that consists of two parts: condition part and 
decision part. The decision part is simply the class label. The condition part is 
composed of a set of simple relations, each of which corresponds to a specific 
attribute. 

(1) For categorical attributes : 

A = tti, 02 , . . . , or Qn, meaning that attribute A has any values of oi, 02 , . . ., 
or an, where n >= 1; 

A\ = 01 , 02 ,..., or an, meaning that attribute A has any values except 
oi, 02 , . . ., and an, where n >= 1. 

(2) For continuous attributes : 

A <= a, meaning that attribute A has any values between the minimum 
and a; 

a < A, meaning that attribute A has any values between a and the maxi- 
mum; 

oi < A <= 02 , meaning that attribute A has any values between oi and 02 . 

For example, the following is a classification rule obtained in our experiment 
with an artificial data set, and is visualized in Fig. 

(30 < Age <= 60)(1.0 < Score <= 3.5){Color = red or yellow) Class = bad. 

It means that if numerical attributes Age and Score have values between 30 and 
60 and between 1.0 and 3.5, respectively, and categorical attribute Color has 
value red or yellow, then the instance will be in class bad. The corresponding 
rule polygon consists of two polygons which are enclosed by the points that 
correspond to value red and yellow on coordinate Color, respectively, the points 
correspond to 30 and 60 on coordinate Age, the points that correspond to 1.5 
and 3.5 on coordinate Score, and the point corresponding to class bad on the 
decision coordinate. 

The rule polygon is constructed by a rule polygon generating procedure, which 
is described in Fig. U] 

To distinguish between positive conditions and negative conditions repre- 
sented with unequal symbol (!), the positive condition is drawn with backward 
hatch, while the negative condition is drawn with forward hatch. In addition, 
the rule accuracy and the rule normalized quality are used to calculate the color 
of the rule polygon. The more accurate the rule, the redder the polygon, while 
the lower the rule quality, the more bright the polygon, 

6 CViz Implementation and Experiment 

The CViz system has been implemented in Visual C-|— I- 6.0. The data prepara- 
tion is accomplished by choosing a data file and an attributes file which describe 
the attributes, including attribute name, type, length, position in the tuple, 
domain, etc. This procedure is implemented in dialog windows (under the file 
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PrOCGdurG for drawing rule polygons 

preCond = first condition; 

repeat 

postCond = next condition; 

for each or component of preCond do 

if preCond corresponds to a categorical attribute 

then compute the points Po,Pi according to the specific categorical value; 
else compute the points Po,Pi according to a numerical interval; 
for each or component of postCond do 

if postCond corresponds to a categorical attribute 

then compute the points P2 , Ps according to the specihc categorical value; 
else compute the points P2,Ps according to a numerical interval; 
draw a small polygon with four points po,Pi,P2, and ps; 
preCond = postCond-, 

until all conditions are processed; 

draw the small polygon from last condition to the decision attribute. 

Fig. 4. Procedure for drawing rule polygons 

menu and setting menu) . The steps of visualizing and discovering knowledge are 
controlled by the control menu, which consists of the five steps. 

CViz has experimented with several data sets from UCI Repository of Ma- 
chine Learning Databases m, including IRIS, Monk’s, etc. and also with arti- 
ficial data sets. An artificial data set is designed to involve two numerical at- 
tributes Age, Score, one condition categorical attribute Color, and one decision 
categorical attribute Class. Age is of integer type and has range [0, 90]. Score is 
of real- valued type and has range [0.0, 5.0]. Color has four discrete values: red, 
blue, yellow, and green. The decision attribute has two class labels: good and 
had. The relationship between the class labels and the condition attributes in 
the data set is designed as follows: 

IF (30<Age<=60) and (1 . 5<Score<=3 . 5) and (Color=red or yellow) 
THEN Class=bad 
ELSE Class=good 

Figs. El through [HI illustrate the original data visualization, discretized data 
visualization, all final rules visualization, rules for class had, and rules for class 
good, respectively. If many rules are obtained and visualized on the same display, 
it is hard to distinguish them. CViz allows user to specify a class label to view 
all rules that have this class label as the decision value. Fig. [S] and Fig. 0 are 
decomposed from Fig. 0 

From Figs. [5] through [9l we can see that one rule is obtained for class bad, 
which is the same as the above one; and three rules are for class good, which 
correspond to three conditions: 

Age<=30 or Age>60, 

Score<=1.5 or Score>3.5, 

Color != (red or yellow), 



respectively. 
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Fig. 5. The original data visualized on the parallel coordinates 




Fig. 6. Discretization of continuous attributes with EDA-DB method 



Fig. do] and Fig.ITTIshow the final results that ELEM2 learns from the Monks- 
1 and IRIS data sets [12], which involve 9 and 8 classification rules, respectively. 
In Fig. |T0[ five rules are obtained for class 0, while there are four rules for class 
1. In Fig. ini we have four rules for class 5, three rules for class 2, and one rule 
for class 1, respectively. Fig. |T^ illustrates all rules for class 2 discovered from 
the IRIS data set, and the rules for class 1 and class 3 discovered from the IRIS 
data set have the similar display form. 
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Fig. 7. All rules discovered from the artificial data set 



Fig. 9. The rules for class good 
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Fig. 11. The rules discovered from the IRIS data 
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Fig. 10. The rules discovered from the Monks-1 data set 
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Fig. 12. The rules for class 2 discovered from the IRIS data 
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7 Related Works 

Keim and Kriegel compare the different techniques for visualizing information 
m and divide these techniques into eight categories: geometric techniques, icon- 
based techniques, pixel-oriented techniques, hierarchical techniques, graph-based 
techniques, 3D techniques, dynamic techniques, and hybrid techniques. Fukuda 
et al. proposed the SONAR system which discovers association rules from two 
dimensional data by visualizing the original data and finding an optimized rect- 
angular or admissible region [Hj . Han and Cercone implemented the DViz system 
for visualizing various kinds of knowledge [H] and the AViz system for visualizing 
three dimensional numerical association rules [O]. Silicon Graphics developed a 
series of visualizers [Z] to visualize different knowledge, like decirion trees, neural 
nets, etc. There have been many well-known visualization systems like VisDB 
m, Spotfire [I], Visage jl], KnowledgeSeeker, DataMind [Zj, etc. but none of 
them implements visualization of the entire process of knowledge discovery. 

8 Concluding Remarks 

CViz is an interactive system for visualizing and learning classification rules. It 
can be used to visualize the original data on a parallel coordinates system. The 
user can interactively reduce the data set horizontally and vertically by remov- 
ing irrelevant attributes and/or attribute values. The user can also interactively 
select his/her favorite approaches to discretizing numerical attributes. The dis- 
cretized attributes are treated as categorical ones and each interval corresponds 
to a discrete value. The ELEM2 induction algorithm is used to learn classifica- 
tion rules which are displayed in visual forms. The user can interactively choose 
a class to view the corresponding rules. Classification rules may have complex 
logical form. In our implementation, we emphasize the human-machine interac- 
tion, since we believe that interactive visualization plays an important role in 
the process of discovering knowledge. Our experiment results have also demon- 
strated that it is useful for users to understand the relationships among data 
and to concentrate on the meaningful data to discover knowledge. The capabil- 
ity of CViz will be expanded so that the user can interactively specify the rule 
accuracy threshold and/or the rule quality threshold, which might be used to 
limit the search space for final rules. 

It would be easy to adapt CViz to other rule-based learning systems because 
the CViz system just encompasses the learning algorithm as a learning step of 
the visualization process and the difference between rule-based learning systems 
is each of them uses a different learning algorithm. It remains a research topic, 
however, to adapt CViz to learning systems other than rule-based ones because 
CViz is based on the parallel coordinate technique which may not be plausible 
for representing decision trees and neural networks. 
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Abstract. Most algorithms to learn belief networks use single- link looka- 
head search to be efficient. It has been shown that such search procedures 
are problematic when applied to learning pseudo-independent (PI) mod- 
els. Furthermore, some researchers have questioned whether PI models 
exist in practice. 

We present two non-trivial PI models which derive from a social study 
dataset. For one of them, the learned PI model reached ultimate predic- 
tion accuracy achievable given the data only, while using slightly more 
inference time than the learned non-PI model. These models provide 
evidence that PI models are not simply mathematical constructs. 

To develop efficient algorithms to learn PI models effectively we benefit 
from studying and understanding such models in depth. We further ana- 
lyze how multiple PI submodels may interact in a larger domain model. 
Using this result, we show that the RML algorithm for learning PI mod- 
els can learn more complex PI models than previously known. 
Keywords: data mining, learning, uncertainty, belief networks. 



1 Introduction 

Learning belief networks |12| from data, as an alternative or enhancement to 
elicitation from experts, has been an active research area in recent years, e.g., 
B uni a m] [61 [5|. As the task is NP-hard mm, a common search method 
used in heuristic learning is the single-link lookahead, where successive graphical 
structures adopted differ by a single link. It has been shown that a class of 
probabilistic models called pseudo-independent (PI) models cannot be learned 
by single-link search m- A more sophisticated method (multi-link lookahead) 
is proposed in m and is improved in |B] for learning decomposable Markov 
networks (DMNs) from data. 

DMNs are less expressive than Bayesian networks (BNs). However, DMNs are 
the runtime representation of some algorithms for inference with BNs [mi2][i3], 
and can be the intermediate results for learning BNs. For example, learning PI 
models needs multi-link lookahead and the search space for DAGs is much larger 
than that of chordal graphs. Learning DMNs first can then restrict the search 
for DAGs to a much smaller space, improving the efficiency. 



H. Hamilton and Q. Yang (Eds.): Canadian AI2000, LNAI 1822, pp. 227 42891 2000. 
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One question often raised is whether PI models arise in practice. It has been 
shown [l^ that parity and modulus addition are special cases of PI models, 
although some consider their occurrences to be less than ‘real’. In this paper, 
we present two PI models discovered from ‘real’ data. The experimental results 
provide evidence that PI models do occur in practice. 

A better understanding of PI models facilitates developing algorithms that 
learn PI models effectively. We provide an analysis of how multiple PI submodels 
are embedded and interact in a model. Based on this analysis, we show that the 
algorithm RML [8] can learn more complex PI models than previously known. 

We review the background on PI models in Section and analyze the inter- 
action of PI submodels in Section [Sj We review the RML algorithm in Section 
and then presents our new result on its learning power in Section El Our exper- 
imental discovery of PI models is presented in Section El 

2 Overview of PI Models 

Let be a set of discrete variables {Xi, . . . , A„} in a problem domain. Each 
variable is associated with a finite number of possible values which we shall 
denote by consecutive integers 0,1,2, A configuration of A^' C A' is an as- 

signment of values to every variable in A', e.g., (Ai = 0, A 2 = 1, ...) which we 
denote by 

Let P{Xi) represents the probability function for Xi and P{xi) denotes 
the probability value of P{Xi = Xi). The joint probability distribution (jpd) 
is P{N) = P{Xi,X 2 , ■ ■ ■ ,Xff) and P{xi^o,X 2 ,i, ■ ■ ■ ,Xn,o) denotes the probabil- 
ity of a particular tuple of A. A probabilistic domain model (PDM) M over A 
determines the probability of every tuple of A' for each A' C A. 

For three disjoint subsets A, B and C, A and B are conditionally independent 
given C, denotes as I{A,C,B)m^ if P{A\B,C) = P{A\C) whenever P{B,C) > 
0. When C = A and B are marginally independent. If each variable A in A 
is marginally independent of A \ {A}, then P{A) = HxgA ^i^)- shall say 
that variables in A are marginally independent. Variables in A are collectively 
dependent if for each proper subset B G A, there exists no proper subset C C 
A \ B such that P{B\A \ B) = P{B\C). Variables in A are generally dependent 
if for any proper subset B, P{B\A \ B) P{B)- We introduce the concept of 
marginally independent subsets to be used in definition of PI models below. 

Definition 1 (Marginally independent subsets) Let N be a set of vari- 
ables. Two disjoint nonempty subsets Ai and A 2 of A are marginally in- 
dependent subsets if for each X G N\ and Y G N 2 , X and Y are marginally 
independent. 

A domain may be partitioned into marginally independent subsets. 

Definition 2 (Marginally independent partition) Let Ai be a PDM over 
A. A partition {Ai,...,Afe} (fc > 1) of N is a marginally independent 
partition if every two subsets Ni and Nj (1 < *, j < A:, i yf j) are marginally 
independent subsets. 



Learning Pseudo-independent Models: Analytical and Experimental Results 229 



We refer to each W as an element of the partition. 

Let A, B and C be disjoint subsets of nodes in an undirected graph G = 
{N, E). C is said to separate A from B, denoted as < A\C\B >g, if every path 
from Ato B has a node in C. Given a PDM Ai over N and a graph G = {N, E), G 
is an I-map of M if for all disjoint A, B, G, we have < A\C\B >< 3 =^ I{A, C, B)m 
m. A minimal I-map is one in which no link can be deleted such that it is still 
an I-map. 

A pseudo-independent (PI) model is a PDM where proper subsets of a set of 
collectively dependent variables display marginal independence m- PI models 
can be classified into three types. The most restrictive type is full PI models. 

Definition 3 (Pull PI model) A PDM over a set N (|A^| >3^ of variables is 
a full PI model if (SI) for each X G N, variables in N \ {A} are marginally 
independent; and (S2) variables in N are collectively dependent. 

In a full PI model, every proper subset of variables are marginally indepen- 
dent. This is relaxed in the partial PI models. In a partial PI model, not every 
proper subset of variables are marginally independent. 

Definition 4 (Partial PI model) A PDM over a set N (|A| > of variables 
is a partial PI model if (SV) N forms a marginally independent partition 
{Ni, , Nk} {k > 1); and (S2) variables in N are collectively dependent. 

In a PI model, it may be the case that not all variables in the domain are 
collectively dependent. An embedded PI submodel displays the same dependence 
pattern of the previous PI models but involves only a proper subset of domain 
variables. 

Definition 5 (Embedded PI submodel) Let a PDM be over a set N of gen- 
erally dependent variables. A proper subset N' C N (\N'\ >3) of variables forms 
an embedded PI submodel if (S)) N' forms a partial PI model; and (S5) the par- 
tition {Ni, . . . ,Nk} of N' by SI’ extends into N. That is, there is a marginally 
independent partition {Ai, ..., Ak} of N such that Ni Q Ai, (i = 1, ..,k). 

In general, a PI model can contain one or more PI submodels, and this 
embedding can occur recursively for any finite number of times. Since variables 
in a PI submodel are collectively dependent, in a minimal I-map, the variables 
in the submodel are completely connected. The marginal independence between 
subsets in the submodel is thus unrepresented. The undirected I-maps can be 
extended into colored I-maps: 

Definition 6 An undirected graph G is a colored I-map of a PDM M over 
N if (1) G is a minimal I-map of M, and (2) for each PI submodel m, links 
between each pair of nodes from distinct marginally independent subsets in m 
are colored. Other links are referred to as black. 
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Conditional independence relations among variables can be read off a colored 
I-map by treating it as a normal I-map while recognizing marginal independence 
between variables connected by colored links. 

A PI model is shown in Table [T] It contains four PI submodels over 

Al = {a, c, d}, N2 = {a, b, c}, A3 = {t>, c, d}, A = {a, b, c, d}. 

The entire domain N forms a partial PI model with the other three PI submodels 
embedded. Figure [T] shows the colored I-map, where colored links are dotted. 



Table 1. A partial PI model with embedded PI submodels. 



{d, a, b, c) 


P(-) 


(d, a, b, c) 


P{-) 


(d, a, b, c) 


Pi-) 


(d, a, b, c) 


P{-) 


(0,0, 0,0) 


0.02 


(0,1, 0,0) 


0.1 


(1,0, 0,0) 


0.03 


(1, 1,0,0) 


0.09 


(0,0, 0,1) 


0.02 


(0,1, 0,1) 


0.06 


(1,0, 0,1) 


0.01 


(1, 1,0,1) 


0.07 


(0,0, 1,0) 


0.06 


(0,1, 1,0) 


0.14 


(1,0, 1,0) 


0.01 


(1,1, 1,0) 


0.15 


(0,0, 1,1) 


0 


(0,1, 1,1) 


0.1 


(1,0, 1,1) 


0.05 


(1,1, 1,1) 


0.09 




Fig. 1. Colored I-map of the model in Table [TJ 



3 How PI Submodels Interact? 

A PI model may contain a number of PI submodals. How are these submodels 
related to each other? We address this question below. An understanding of their 
interaction will guide us in designing better learning algorithms and evaluating 
the quality of learning outcomes. 

First, we refine the concept of marginally independent partition. Given a 
PDM, it may have multiple marginally independent partitions. We identify the 
‘finest’ partition as follows: 

Definition 7 (Minimum partition) Let A4 be a PDM over N. A marginally 
independent partition {Ai, . . . , A^} (k > 1) of N is minimum if no A^ (1 < i < A:) 
can be partitioned into marginally independent subsets. 

For instance, in the PDM of TablelH {{a}, {c, d}, {6}} is a minimum parti- 
tion. A minimum partition is unique as shown below: 

Proposition 8 For any PDM, it has a unique minimum marginally independent 
partition. 
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Next, we define covered and uneovered colored links in a PI submodel to 
describe the relation between PI submodels which share variables. It turns out 
that this relation has a lot to do with how each submodel can be learned and in 
what order, as will be seen. 



Definition 9 [Uncovered colored link] Let G be a colored I-map of a PI model 
A4. Let I be a eolored link in a PI submodel m which contains km colored links. 
The link I is uncovered in m if there exists no PI submodel s in M sueh that I 
is also eontained (covered) in s and the number of colored links kg of s satisfies 
kg <c. km . 

Figure [2] (a) illustrates PI submodels with both covered and uncovered col- 
ored links. Figure |2] (b) illustrates PI submodels which share variables but all 
colored links are uncovered. Table [T] is a numerical example for the case of Fig- 
ure [2] (a), where m is over {a, 5, c} and s is over {5, c, d}. A numerical example 
for the case of Figure [2] (b) is given in Table [21 where m' is over {a, b, c} and s' 
is over {5, c, d}. 




Fig. 2. (a) Two PI submodels m and s (each enclosed in an oval) share variables 
{c, d, e}. Assume no other PI submodels share variables with them. Submodel 
TO has nine colored links. Six of them, (a, c), (a, d), (a, e), (b, c), (b, d), (6, e), are 
uncovered. The remaining are covered. Submodel s has six colored links all of 
which are uncovered, (b) Two PI submodels m' and s' share variables {6, c, d}. 
Submodel m' contains six colored links and all of them are uncovered. Submodel 
s' has the same number of colored links and all of them are uncovered. 



Table 2. A model with two PI submodels 



(d, a, b, c) 


P{-) 


(d, a, b, c) 


P(-) 


(d, a, b, c) 


Pf) 


(d, a, b, c) 


P(-) 


(0,0, 0,0) 


0.0024 


(0,1, 0,0) 


0.00336 


(1,0, 0,0) 


0.00336 


(1, 1,0,0) 


0.004704 


(0,0, 0,1) 


0.000064 


(0,1, 0,1) 


0.000448 


(1,0, 0,1) 


0.000448 


(1, 1,0,1) 


0.003136 


(0,0, 1,0) 


0.002304 


(0,1, 1,0) 


0.008064 


(1,0, 1,0) 


0.008064 


(1,1, 1,0) 


0.028224 


(0,0, 1,1) 


0.0024 


(0,1, 1,1) 


0.00336 


(1,0, 1,1) 


0.00336 


(1.1. 1,1) 


0.004704 
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Next, we analyze several properties of PI models related to the concepts 
introduced above. Lemma 1 101 savs that the number of colored links in any PI 
models is lower bounded at 2. 

Lemma 10 Let M. he a PI model and k be the number of eolored links in A4. 
Then k >2. 

Lemma n] reveals the features of PI models when the lower bound in 
Lemma [TOl is reached. 

Lemma 11 Let M be a partial PI model over N with exaetly two eolored links. 
Then (1) the minimum marginally independent partition Par of N has two ele- 
ments and (2) |N^| = 3. 

Lemma |12| says that the upper bound of uncovered colored links in PI sub- 
models are lower bounded at 2. 

Lemma 12 Let M be a PI model and each embedded PI submodel in A4 con- 
tains no more than i uncovered colored links. Then i > 2. 

Theorem US] reveals how PI submodels in a PI model are related to each 
other. 

Theorem 13 Let PDM M be a PI model and {Mi, . . . , Mj} (j > 1) be the PI 
submodels in A4. Let D be a direct graph of j nodes where each node is labeled 
by a Mi such that Mj is a parent of Mi if Mj has colored links covered by Mi . 
Then D is acyclic. 

We shall call D the PI submodel coverage DAG of A4. 

D may be disconnected. If two submodels neither share colored links directly 
nor share colored links through a chain of intermediate submodels, then the two 
submodels will be disconnected. In a minimal colored I-map, the two submodels 
will be connected through black links. 

Another case of disconnection is when two submodels share colored links only 
with each other and have the identical number of colored links. Since none can 
cover the links of the other, there cannot be a directed link between them in D. 
Each PI submodel with all its colored links uncovered is a leaf in D. 

For example, in the PDM shown in Table [T] there are four PI submodels 
Ml, M 2 , Ms and M4 over Ni,N 2 ,Ns and N, respectively. Its PI submodel cov- 
erage DAG has a root node M4 with three child nodes Mi, M 2 and M3. 

4 Overview of RML Algorithm 

PI models cause difficulty to common algorithms that are based on a single-link 
lookahead search m- The initial attack on learning PI models m is based 
on iterations of lookahead(i) as shown in Algorithm [1] It is intended to learn 
a decomposable Markov network (DMN) from a dataset over a set of N of 
variables. The K-L cross entropy is used as the score metric. The algorithm 
consists of a sequence of calls of lookahead{i) with i taking the values 1, 2, ..., 
k for a specified k value. 
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Algorithm 1 boolean lookahead (i); 

Parameter i: number of lookahead links. 

Input: graph G = (N,E) and threshold Sh. 

begin 

modified := false, G' := G; 
repeat 

initialize entropy decrement dh' := 0; 
for each set L of links (\L\ = i, L f] E = <f>), do 
if G* = {N, E U L) is chordal, then 
compute entropy decrement dh* ; 
if dh* > dh', then dh' ■.= dh* , G' := G* ; 
if dh' > Sh, then 

G G' , done := false, modified := true; 
else done := true; 
until done = true; 
return modified; 
end 



The lookahead(i) performs a multi-link lookahead search which examines i 
link(s) at each step. That is, alternative structures that differ from the current 
structure by i links are evaluated. The i links that decrease the entropy maxi- 
mally are selected. If the corresponding entropy decrement is significant enough, 
the i links will be adopted and the search continues until no more links can be 
learned. We refer to this search as an i-link-only search. 

The algorithm based on i-link-only search with incrementally larger i values 
can learn many PI models correctly. However, when a PI model contains recur- 
sively embedded PI submodels, the algorithm fails. For example, if the data is 
populated by the PI model in Table [T| then after learning the black link in the 
single-link search and submodels Mi and M 3 in the double-link-only search, the 
algorithm will halt, missing the colored link {a,b}. 

Algorithm 2 RML 

Input: data over a set N of variables, a maximum number k of lookahead links, 
begin 

1 initialize a graph G = {N, E — (f); 

2 for j := 1 to k, do 

3 i := j; 

4 while i < j, do 

5 modified := lookahead(i); 

6 if i > 1 AND modified = true, then i := 1; 

7 else i := i + 1; 

8 return G. 
end 



The algorithm RML was proposed |B1 to improve the performance (shown in 
Algorithm |2) . 

RML also uses lookahead{i) . However, whenever links are learned at an i-link- 
only search, RML backtracks to single-link search as shown in line 6. This allows 
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RML to learn recursively embedded PI submodels correctly. For the PI model 
in Table |T] the link {a, 5} will be learned when backtracking to the single- link 
search after the double-link-only search. The complexity of RML was analyzed 
in i. 

5 Models Learnable by RML 

The PI models that are learnable by RML was analyzed in jS]. It was concluded 
that if a PI model has no submodel that contains more than k colored links, 
then RML with the parameter k can learn the model correctly. In the following, 
we refine that result. The new result shows that RML with the parameter k can 
actually learn PI submodels with more than k colored links. This expands the 
known learning power of RML. 

We first introduce some background. We define two properties that are sat- 
isfied by some PDMs: 

Definition 14 [T5f Let A, B, C, V and W be disjoint subset of variables. 
Composition: I{A, B, C) & I{A, B, W) =4> L{A, B,C U W). 

Strong Transitivity: I{A,BuV,C) Sz I{B,CUV,W) I{A,BuV,CuW). 

We shall use the following result from m which characterizes the learning 
capacity of a lookahead{l)-like single-link lookahead for learning DMN^. For 
the purpose of analysis, we assume a prefect data (no sampling error) and the 
threshold Sh is then set to zero. 

Theorem 15 Let M be a PDM that satisfies composition and strong tran- 
sitivity. Let G be a chordal graph returned by a lookahead{l)-like single-link 
lookahead search. Then G is an L-map of Ai. 

Theorem If 61 shows the learning capacity of RML. 

Theorem 16 Let PDM A4 be a PL model over N. Let Par = {Ni, . . . ., Nj} 
(j > 1) be the minimum marginally independent partition of N. Lf variables 
in each Ni satisfy composition and strong transitivity, and each embedded PL 
submodel contains no more than k {k > 2) uncovered colored links, then RML 
with the parameter k will return an L-map of A4 . 

Theorem ng shows that RML with parameter k is not limited to learning 
PI submodels with up to k colored links. Rather, it is limited to learning PI 
submodels with up to k uncovered color links. Hence, a PI submodel with more 
than k colored links is learnable by RML as long as it shares colored links with 
other PI submodels so that it has no more than k uncovered links. Since the 
time complexity of RML is exponential on k, this result implies that much more 
complex PI submodels (compared with the previous result El) can be learned 
correctly without increasing the computational complexity. 

^ There are some minor differences between the single-link lookahead used in m 
and lookahead(l) , e.g., the maximum score improvement at each search step is not 
required there. However, the difference is irrelevant to the current result. 
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6 Discovery of PI Models from Data 

Since previous reports on the study of PI models have used constructed models 
like those in Tables [Hand El the practical value of such study has been questioned 
by some. In the following, we provide two PI models that are discovered from 
real data in a preliminary experimental study. 

The data we used is from the 1993 General Social Survey ( GSS) on Personal 
Risk conducted by Statistics Ganada in 1993 [2]. The dataset contains 11960 
cases over 469 variables. A preliminary study was performed on a few subjects. 
The experiment was carried out using WEBWEAVR-III toolkit (available for 
downloading at the first author’s homepage) which implements the RML algo- 
rithm as one module. 



Table 3. Variables in data on Accident Prevention 



index 

0 

1 

2 

3 

4 

5 

6 
7 



Variable 

U seSeatBelt 

Wear Helmet 

MedFrmKid 

SafetyEquip 

SmokeAlarm 

FireExtsher 

FstAidSuply 

FstAidTrain 



Question 

Accident Protection : Use seat belt in vehicle! 
Accident Protection : Wear Helmet riding bicycle! 
Accident Protection : Store medicines from children! 
Accident Protection : Use safety equipment! 

Do you have a working smoke detector in your home! 
Do you have a working fire extinguisher at home! 

Do you have first aid supplies at home! 

You or household members trained in first aid! 



One PI model we discovered is on “Accident Prevention Precautions” (Ta- 
ble ED . The first eight variables (questions) in the data were used in the study. 
All of them are binary. After deleting cases with missing variables, 4303 cases 
were used as the learning input. Using k = 2, the learning program returned the 
DMN in Figure El (d). 

The learning process is shown in Figure [3] The first single-link lookahead 
search learned a disconnected graph shown in (a). Note that three marginally 
independent subsets were found. In the following double-link-only search, the 
three PI submodels below were learned and shown in (b), (c) and (d), respec- 
tively. 

Ml : {SafetyEquip, FireExtsher, FstAidTrain} 

M 2 : [WearHelmet, MedFrmKid, SafetyEquip} 

M 3 : {SafetyEquip, FireExtsher, FstAidSuply, FstAidTrain} 

Note that Mi is recursively embedded in the PI submodel M3. After the 
double-link-only search, backtracking occurred without learning additional links. 

Another PI model we discovered is on “Harmful Effects of Personal Drink- 
ing” . The data contains 8 variables (questions) described in Table (U The first 
six variable are binary. The last two variables each has the domain {NoDrinking, 
lTo2Drinks, EnoughToFeelTheEffects, GettingDrunkIsSometimesOk }. After 
deleting cases with missing variables, 8047 cases were selected. The first 7047 
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Fig. 3. The process of learning Accident Prevention model. Colored links are 
shown as dotted. 



cases were used as the learning input, and the other 1000 cases were hold as test 
set (see below). 

Using fc = 1, the learning program returned the DMN in Figure |H (a) with 
two marginally independent subsets. Using k = 2, a, partial PI submodel M = 
{HarmHealth,HarmFinance,NmNonDrvr Drink} was detected. The DMN 
returned is shown in (b) where colored links are shown as broken lines. 



HarmSocial HarmLifMrig 


Hai'mSocial HarmLifMrig (5) 


\ • 

\ HarmWorkSty 

HrmLifOutlk^V ^ 


\ ^^....JiarmWorkSty 
HrmLifOutlkVX 


HarmHealth HamiFinance 


HamiHealth , /' HarmFinance 


• • NmNonDrvrDrink 

NumDrivrDrink 


• ^ NmNonDrvrDrink 

NumDrivrDrink 



Fig. 4. DMN learned from data on Harmful drinking 



A PI model captures more dependence in a data and hence will provide better 
prediction when used for future decision making. On the other hand, it is also 
more expensive to learn and to perform inference with. A good model is one that 
provides sufficiently better prediction without being too much more expensive 
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Table 4. Variables in data on Harmful drinking 



i 


Variable 


Question 


0 


HarmSocial 


Did alcohol harm friendships / social life? 


1 


HarmHealth 


Did alcohol harm your physical health? 


2 


HrmLifOutlk 


Did alcohol harm your outlook on life? 


3 


HarmLif Mrig 


Did alcohol harm your life or marriage? 


4 


HarmWorkSty 


Did alcohol harm your work, studies, etc? 


5 


HarmF inance 


Did alcohol harm your financial position? 


6 


NumDrivr Drink 


How many drinks should a designated driver have? 


7 


NmNonDrvr Drink 


How many drinks should non — designated driver have? 



in inference. To evaluate the overall “goodness” of the learned PI model, we 
compared the performance of three learned models: 

1. The DMN in figured (a) which we refer to as Non-PI DMN. To be able to 
reason with it in a normal inference engine, we added a dumb link between 
HarmFiance and NumDrivr Drink to make it connected. However, the 
link potential table carries no dependence. 

2. The DMN in (b) which we refer to as PI DMN. 

3. A completely connected DMN which we refer to as jpd DMN. 

Since the Non-PI DMN is a subgraph of the PI DMN which in turn is a sub- 
graph of the jpd DMN, we expect that they provide increasingly better prediction 
and inferences are increasingly more expensive. We tested the three DMNs using 
the other 1000 cases which the learning program did not see. For each case, we 
used the value of the following six variables as observations, 

HarmSocial, HarmHealth, HrmLifOutlk, 

HarmLifMrig, HarmWorkSty, HarmFinance, 

which were all taken from one marginally independent partition. We then per- 
formed inference in each DMN to predict the value of NmNonDrvr Drink in 
the other marginally independent partition. The results are shown in Table [3 



Table 5. Evaluation summary. 



Learnednet 


Infer.time (s) 


Hitcount 


Avg.Euc. 


Avg.KL 


NonPI DMN 


11.82 


315 


0.0617 


0.01842 


PI DMN 


16.24 


347 


0.0193 


0.00786 


JPD DMN 


638.57 


347 


0.0 


0.0 



Inference for 1000 cases using the Non-PI DMN took 11.82 sec (see the second 
column). The PI DMN took 37% longer (16.24 sec). However, the jpd DMN took 
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about 60 times longer. Hence, the PI DMN and Non-PI DMN are comparable 
in terms of inference efficiency. 

Out of 1000 cases, the Non-PI DMN predicted correctly for 315 cases (see 
the third column), while the jpd DMN predicted correctly for 347 cases, which 
is 10% better. Note that since the target variable has four possible values, a 
random guess is expected to hit about 250 cases. Furthermore, since the jpd 
DMN captured all the probabilistic dependence in the data, we can do no better 
than its performance given only the training data. Interestingly, the PI DMN 
predicted just as well as the jpd DMN, while it only used a very small fraction of 
the inference time of the jpd DMN. The last two columns of the table show the 
Euclidean and K-L (cross entropy) distances between the posterior distribution 
by each DMN and that by the jpd DMN, averaged over the 1000 cases. The 
distances by PI DMN is much smaller than those by the Non-PI DMN. 

7 Conclusion 

Our experimental discovery of the two PI models suggests that PI models, in- 
cluding recursively embedded PI models, are not simply mathematical constructs 
but are practical reality. In our performance comparison, the learned PI model 
reached ultimate prediction accuracy with only slight increase in inference com- 
plexity compared with the learned Non-PI model. The PI models that we pre- 
sented were discovered after only a few trials from one data set. The increase in 
prediction power obtained from the model is far from the potential increase that 
can be expected according to the theory of PI models. Hinted by the theory, 
we believe that PI models with more impressive gain in prediction power can 
be found. We plan to demonstrate that with more search in the future. On the 
other hand, our performance comparison does show that the concept of PI mod- 
els is useful in practice when one seeks to discover models with better overall 
performance. 

Given the usefulness of learning PI models, a better understanding of the 
characteristics of PI models can provide valuable guidance to the design of algo- 
rithms that can learn such models effectively. Our analysis of RML algorithm is 
one more step in that direction. It not only expands the boundary of learnable 
PI models with given computation resources, but also reenforce our belief that 
with some controlled increase of complexity, PI models can be learned tractably. 
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Abstract. Domain independent planners can produce better-quality 
plans through the use of domain-specific knowledge, typically encoded as 
search control rules. The planning-by-rewriting approach has been pro- 
posed as an alternative technique for improving plan quality. We present 
a system that automatically learns plan rewriting rules and compare it 
with a system that automatically learns search control rules for partial 
order planners. Our results indicate that learning search control rules is 
a better choice than learning rewrite rules. 



1 Introduction 

AI planners must be able to produce high quality plans, and do so efficiently, 
if they are to be widely deployed in the real-world planning situations. Vari- 
ous approaches have shown that incorporating domain knowledge into domain- 
independent planners can improve both the efficiency of those planners i]ii]i7] 
and as well as quality of the plans they produce [HH]. Traditionally, this knowl- 
edge is encoded as search control rules to limit the search for generation of the 
first viable plan. Recently, Ambite and Knob lock have suggested an alternative 
approach called planning by rewriting [1]. Under this approach, a partial-order 
planner generates an initial plan, and then a set of rewrite rules are used to 
transform this plan into a higher-quality plan. Unlike the search control rules 
for partial order planners (such as those learned by UCPOP-I-EBL and PIPP 
m) that are defined on the space of partial plans, rewrite rules are defined on 
the space of complete plans. In addition, it has been argued that plan-rewrite 
rules are easier to state than search control rules, because they do not require 
any knowledge of the inner workings of the planning algorithm [T]. That may 
partially explain why most of the search-control systems have been designed to 
automatically acquire search-control rules, whereas existing planning by rewrit- 
ing systems use manually generated rewrite-rules. To date, there has been no 
comparison of these two techniques to study their strengths and weaknesses. 
This paper presents an empirical comparison of how the two techniques (search 
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control rules vs rewrite rules) improve plan quality within a partial-order plan- 
ning framework. Our focus, however, assumes that both rewrite rules as well as 
search control rules are to be learned as a function of planning experience. 

We designed two systems, Sys-REWRITE and Sys-SEARCH-CONTROL, 
that automatically learn to improve quality of the plans produced by the partial 
order planners. Both systems have the same overall structure, shown in Figure 
[T] and only differ in their implementation of the last step. For Step 1, both 
systems use a partial order planning algorithm, POP, of the sort described in [TO] . 
The learning algorithm, ISL (Intra-Solution Learning algorithm), that the two 
systems use for Step 2 is similar to that used in [1^ and is described in the section 
that follows. In Step 3, Sys-REWRITE uses the output of Step 2 to create plan- 
rewrite rules, while Sys-SEARCH-CONTROL uses that information to create 
search-control rules. The performance component of the two systems necessarily 
differs, by definition: Sys-SEARCH-CONTROL uses its rules during its planning 
process, whereas Sys-REWRITE uses the rules after it has completed what we 
might think of as its draft partial plan. 



Input: - Problem description in terms of initial state 1 and goal G 

- An model plan Q for this problem 

Output: - A set of rules 

1- Use a causal-link partial-order planner to generate a plan P for 

this problem. 

2- Identify learning opportunities by comparing the plan episode for 
P with the inferred plan episode for Q 

3- Learn a rule from each learning opportunity and store it. 



Fig. 1. High level Algorithm 



2 Overview of Approach 

Our approach to plan quality representation and the underlying learning algo- 
rithm may be briefly described as follow^. We assume that complex quality 
tradeoffs among a number of competing factors can be mapped to a quantitative 
statement. Methodological work in operations research indicates that a large set 
of quality-tradeoffs (of the form “prefer to maximize X rather than minimize Y” ) 
can be encoded into a value function, as long as certain rationality criteria are 
met . We also assume that a quality function defined on resources consumed 
in a plan exists for a given domain and use a modified version of R-STRIPS jlT] 
to represent resource attributes and the effects of actions on those resources. 

Given the knowledge about how to measure the quality of a complete plan, 
the learning problem then is how to translate this global quality knowledge into 



^ For details the reader is referred to [15) . 
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knowledge that allows the planner to discriminate between different refinement 
decisions at a “local” level, i.e., to learn search control knowledge. It is this 
general approach that we will contrast with learning rewrite rules. 

As Figure [U indicates, the training data for both Sys-REWRITE and Sys- 
SEARCH-CONTROL consists of the description of a problem in terms of the 
initial state and goals, and a completed high-quality plan (a set of totally ordered 
steps) that serves as a kind of model. The higher quality model plan can be 
generated by the planner itself through a more exhaustive search of the plan 
space or supplied by an external agent (as is done in apprenticeship learning 
systems i)- The learning step (Step 2) is triggered if the model plan is of 
higher quality (as per the quality metric) than the the system’s default plan 
for the same problem. Learning occurs in the context of considering differences 
between higher-quality model plan with the lower-quality default plan produced 
by the system. But because the planner is a partial-order planner, what it must 
learn is how to make better plan-refinement decisions, i.e., the form of knowledge 
to be acquired must affect the partial-order planning process, at least when the 
rules to be learned are search control rules. Thus, learning occurs by considering 
differences in the planning refinement trace that produced the partial order plan 
and elements of an inferred planning refinement trace that is consistent with the 
model plan, Q. This is the heart of the ISL algorithm that is described in more 
detail in Section 3.2, and that identifies the knowledge that will be turned into 
either search control rules or rewrite rules. 



3 System Architecture 

We describe the architecture in terms of the three steps outlined in Figure 1. 



3.1 Step 1: The Planning Component 

The planning element is a causal-link partial-order planner (POP) that, given an 
initial state and some goals, produces a linearized plan that is consistent with 
the partial ordering constraints on steps that it identified during its planning 
process. 



3.2 Step 2: Learning from Plan Refinement Traces 

We will use the transportation problem shown in Figure Elto illustrate the work- 
ings of the ISL algorithm. The ISL algorithm, shown in Figure El looks for differ- 
ences between two solutions to a planning problem that differ in overall quality. 
It has both the default plan and the default planning trace produced from Step 1, 
plus the model plan Q. We do not assume that the planning trace that produced 
the model higher-quality plan is available— just the model plan itself. Therefore, 
ISL’s first step is to reconstruct the causal- link and ordering constraints that 
are consistent with the step sequence that defines the model plan. The model 
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Input: - trace for system’s plan Ptr={dl, dn} 

- the better plan Q 

Output: - A set of conflicting choice points C 
- for each conflicting choice point 

- the plan obtained by making thebetter-choice 
at this point and then letting the system refine 
it completely and the trace of this plan 

- the trace of the better plan 

2.1- Analyse Q to determine the set of better-plan-constraints QC 

2.2- di <- dl 

2.3- While not empty(Ptr) do 

- If the constraint C added by the decision di is in QC then 

- add C to the current partial plan PO 

- i <- i+I 
-else 

2.3.1- mark this decision point as a conflicting choice point 

2.3.2- examine QC to compute the constraint BC that 

resolves the current flaw 
2.3.3 - add BC to PO 

2.3.4- invoke POP to refine PO and produce 
a plan Pc and its trace Trc. 

2.3.5- Ptr <- Trc 
- i <- i+1 



Fig. 2. The Intra-Solution Learning (ISL) Algorithm (Step 2 of Algorithm 1). 



constraint set inferred by ISL from the model plan presented earlier in Figure E] 
is shown in Figure El 

The next step is to retrace the default planning-trace (from step 1), looking 
for plan-refinement decisions that added a constraint that is absent in the model 
plan’s constraint set. We call such a decision point a conflicting choice point. 
Each conflicting choice point indicates a possible opportunity to learn a plan- 
refinement decision that contributes to producing a better quality plan. 

Given the default planning trace and the model constraint-set shown in Fig- 
ure |5] ISL retraces the default planning trace (shown in Figure |4|) looking for 
a planning decision that adds a constraint not present in the model constraint- 
set. Node 1 in Figure |4] is one such node where the default planner resolves 
the open-condition flaw at-objcct(ol,ap2)end by performing add-action: unload- 
truck(ol,TR,ap2), which adds the ceMsal-lmkunload-truck(ol,TR,ap2) 

at-obj{ol,ap2) i t i • • i 

— ^ end to the partial plan. But this causai-hnk is not m the model 

constraint-set for this problem shown in Figure The model constraint-set con- 
tains a causal link unload-plane{ol, pll, ap2) ^ ■ p 1 other words, 

the model planner resolved the precondition at-objcct(ol,ap2)end by add-action: 
unload-plane(ol, pll, ap2). Hence, Node 1 is labeled as a conflicting choice point. 
Simply put, the two plans differed in the way they achieved an open condition. 
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Initial-state: {at-object(o1 , apt), at-object(o2, ap2), at-truck(tr1 , ap1), 

at-truck(tr2, ap2), at-plane(pl1 , apl), same-city(ap1, pol), 
same-city(po1 ,ap1 ), same-city(ap2,po2),same-city(po2,ap2), 
position(ap1 , 10), position(po1 , 15), position(ap2, 100), 
position(po2, 110), money(IOOO), time(O)} 

Goals: {at-object(o1 , ap2), at-object(o2, po2)} 



System’s Plan 


Model Plan 


load-truck(o1, tri, apl) 
drive-truck-acities(tr1 , apl, ap2) 
unload-truck(o1 , tri, ap2) 
load-truck(o2, tr2, ap2) 
drive-truck(tr2, ap2, po2) 
unload-truck(o2, tr2, po2) 


load-plane(o1 , pll, apl) 
fly-plane(pl1 , apl, ap2) 
unload-plane(o1 , pll, ap2) 
load-truck(o2, tr2, ap2) 
drive-truck(tr2, ap2, po2) 
unload-truck(o2, tr2, po2) 



Fig. 3. Problem 1: A Transportation planning problem. 



Learning a single search control rule that ensures the application of the model 
planning decision at this point may turn a low-quality plan into a higher-quality 
plan, but it is rather unlikely that this was the only reason for the difference 
in quality between the default plan and the model plan. There may be more 
opportunities to learn what other decisions lead to a better quality plan for 
the same problem. To identify the other planning decisions whose rationale the 
existing planer lacks, ISL adds the constraint added by the model plan at this 
point to the partial plan being refined. Once the higher-quality plan’s planning 
decision has been applied to the partial plan being refined, ISL calls the existing 
planner again to re-plan from that point on (Step 2.3.4 of ISL). A new plan 
and a new trace (that is the same as the initial trace up to the now-replaced 
conflicting choice point, and possibly different thereafter) is returned for this 
same problem, and the process of analyzing this new trace against the constraints 
of the higher-quality model plan is done again. This analysis may lead to more 
conflicting choice points (as indeed is the case with the example scenario shown 
in Figured! at Node 10 the system’s new plan makes a different choice than the 
model plan) . Eventually, the default planner will generate a planning trace that 
is consistent with the constraint set inferred for the higher-quality model plan. 
That ends the learning about plan quality that can be accomplished from that 
single training problem. 

For any conflicting choice point, there are two different planning decision 
sequences that can be applied to a partial plan: the one added by the existing 
planner, and the other added by the model planner. The application of one set of 
planning decisions leads to a higher quality plan and the other to a lower quality 
plan. It would be possible to construct a rule that indicates that the planning 
decision associated with the better-quality plan should be taken if that same 
flaw is ever encountered again. However, this would ensure a higher-quality plan 
only if that decision’s impact on quality was not contingent on other planning 
decisions that are “downstream” in the refinement process, i.e., further along the 



Learning Rewrite Rules versus Search Control Rules 



245 



search path. Thus, some effort must be expended to identify the dependencies 
between a particular planning decision and other planning decisions that follow 
it. 

To identify what downstream planning decisions are relevant to the decision 
at a given conflicting choice point, the following method is used. The open- 



Node1 

add-step mload-plane 



Node 2 



al-object(ol, ap2) 

start Enrt"'"*'"''”^’'’"* 

' 'add-step miload-truck 

1st conflicting choice point ^ 




ai-ploMipIlfpl) ._ioad.plane(o1,pl1,ap1) 

Node 10 / -^'m-objecl(olpipl) '[molpllj al-objea(oi,ap2) 

Start' unload-plane(o1,pl1,ap2) ► End 

''^l-plane(pll,apl) 

^fly-plane(pl1,ap1,ap2) 



''^at-plane(pll,ap2j 



al-truek(Tr2,From4) / 
aU„iKT,2.p,2) P>2) 

unload-truck(o2,Tr2, po2) 



at-object{o2,From3) 

at-tmck(Tr2,From3i^‘''^*^^’^^^’ 

ioad'1ruck(o2Jr2,From3) 

Path A 

add-step drive-truck ' . 

Node 11 (a) ^ - 2nd conflicting choice point 

p,.pk.,l,l2apjl .ioad.plane(o1,pM,ap1) 

/ 

Start 

\at-truck(pll,apl) 

at-piane(pil,ap2) 



~[in(oI,plI) at-objectiol, ap2) 

^ unload-plane(o1,pl1,ap2) End 



''' at-objeet(oI,api) 

at-truck(pil,apl) 

'^flypla™e(pl1,apt,ap2) . • 

at-tnick{Tr2,From4) ° ^ ^ unload-truck(o2,Tr2, po2) 

drive-truck(Tr2,From4, po2) ^ 
at-object{o2.From3) 
at-truek(Tr2,From3i^ 
load-truck(o2,Tr2,From3) 



PathB 



add-step </rive-/rHci-acifiM 



" load'plane(o1,pl1,ap1) 



Node 11(b) 



at-object(oI,apl ) 

\at-truek(pll,aplj 

^tly-plane(pl1 ,ap1 ,ap2) 



~[in(oi,plI) at-object(o], ap2) 

t unload-plane(o1,pl1,ap2) ► End 



at-plaije(pll,ap2j 
2,FroK 

drive-truck'acities(Tr2,From4, po2) 
at-object(o2,Fron, 
at-tnicklTr2,From 
load'truck(o2Jr2,From3) 



at-truck(Tr2,From4) °^c^2,p^ unload-truck(o2,Tr2, po2) 



at-object(o2,From3) yr\. 
t-truck(Tr2,From3)^ 



at-truek(Tr2,From4) / 
.M,ucKm.p,2> y rm 



Fig. 4. Conflicting choice point that leads to Path A (left), from the higher- 
quality plan, and to Path B (right), the lower-quality plan. We use italics to 
represent open preconditions which are treated as subgoals. When these pre- 
conditions are still open (i.e., have not been satisfied), they are displayed next 
to the action that requires them. Arrows between actions denote causal-links 
showing which subgoals of an action have been satisfied. The arrow direction is 
from producer to the consumer of a condition. 
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unload-plane(o1, pl1, ap2) 
load-plane(o1, pl1, ap1) 

fly-plane(pl1, ap1, ap2) 



start 

start 



at-object(ol , ap 2 ) 

end 

in(o1, pl1) 

unload-plane(o1, pM, ap2) 



at-plane(pl1,ap2) 



unload-plane(o1, pH, ap2) 



at-object(o1, ap1) 



at-plane(pl1, ap1) 



load-plane(o1, pl1, ap1) 
fly-plane(pl1, ap1, ap2) 



unload-truck(o2, tr2, po2) 
load-truck(o2, tr2, ap2) 
drive-truck(tr2, ap2, po2) 

start 

start 



at-object(o2, po2) 



in(o1, tr2) 



at-truck(tr2, po2) 



at-object(o2, ap2) 



at-truck(o2, ap2) 



end 

unload-truck(o2, tr2, po2) 
unload-truck(o2, tr2, po2) 
load-truck(o2, tr2, ap2) 
load-truck(o2, tr2, ap2) 



Fig. 5. Constraints inferred from the model plan of Problem 1. 



conditions at the conflicting choice point and the two different planning decisions 
(i.e., the ones associated with the high quality model plan and the lower quality 
default plan) are labeled as relevant. The rest of the better-plan’s trace and the 
rest of the worse-plan’s trace are then examined, with the goal of labeling a 
subsequent planning decision q relevant if 

— there exists a causal-link q — ^ p such that p is a relevant action, or 

— q binds an uninstantiated variable of a relevant open-condition. 

For instance, consider again the first conflicting choice point at Node 1 shown 
in Figure |4] There are two open-conditions flaws in the partial plan, but the flaw 
selected to be removed at this point is the open-condition at-object(ol, ap2). 
Clearly, the decision add-action: unload-plane(ol,Pl,ap2) on Path A (left path) 
is relevant. Similarly, the decisions to add-action: load-plane( ol,pll,apl ) and add- 
action: fly-plane(pll ,apl ,ap2) are relevant because they supply preconditions to 
the relevant action unload-plane(ol,Pl,ap2). Further along Path A, the decision 
establish: at-object(ol, apl) is relevant because it supplies an precondition to 
the relevant action fly-plane(pll,apl,ap2). However, the planning decisions add- 
action: unload- truck(o2, Tr2, po2), and add-action: drive-truck(Tr2, From4, po2) 
are not relevant because the open conditions they resolve are not relevant. The 
labeling process stops on reaching the leaf nodes and the two relevant planning 
decision sequences (for each conflicting choice point) are out put. ISL outputs 
the two planning decision sequences shown in Figure E21 for the first conflicting 
choice point. 
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Lower quality sequence 

add-action: unload-truck(ol,Tr,ap2) to resolve at-object(0, TjEnd 
add-action: load-truck(ol,Tr,From2) to resolve 
in(olf 

~truck 

add-action: drive-truck-acities(Tr,From2,ap2) to resolve 
at-truck(Tr f 

establish: at-object (ol , with at-object t(0, 

establish: at-truck(Tr,From2)^^j_^f,-^^„^j^-^^i^i^^ with at-truck(Tr, J)° 
establish, with neq(X^ 7) 

Better quality sequence: 

add-action: unload-plane(ol ,Pl , ap2) to resolve at-object(ol, ap2)End 
add-action: load-plane(ol ,Pl ,Froml) to resolve in(ol, Pi) unload- plane 
add-action: fly-plane(Pl ,Froml , ap2) to resolve 
at-plane(Pl, FrOml))unload-plane 
establish: at-object (ol , From)i^ud-piane with at-object (0, 
establish: at-plane(Pl,X)) fiy-pi^ne with at-plane(Pl, X 
establish: neq(apl , ap2) fiy-pi^ne with neq(X, 



Fig. 6. Two planning decision sequences identified by ISL for the first conflicting 
choice point shown in Figure |4] The notation PrcAct indicates that Pre is a 
precondition of Action Act and the notation indicates that Eff is an 

effect supplied by the action Act. 



3.3 Step 3: Learning Search- Control Rules 

Once ISL identifies the relevant refinement decisions associated with the way 
in which a given choice point was resolved differently for the the higher-quality 
plan and the worse plan, a search control rule can be created. To do this, Sys- 
SEARCH-CONTROL computes (a) the open-condition flaws present in its par- 
tial plan that the relevant decision sequence removes, (b) the effects present in 
its partial plan that are required by the relevant decision sequence, and (c) the 
quality value of the new subplan produced by the relevant decision sequence. 
Sys-SEARCH-CONTROL then use this information to store the rationale (the 
pre-conditions) for applying each refinement decision sequence. For the example 
shown in Figure 3, the rationale learned for the refinement sequence associated 
with the higher-quality plan is El 

open-conditions: { at-object (0, Y)Acti } 

effects: { at-object(0, , at-plane(Pl, neq(X, }. 

quality: 170 - 3 * distance (Y, X)/200. 
trace: add-action: unload-plane(0 ,Pl ,Y) to resolve 
at-object (0, Y)Acti 

add-action: load-plane(0,Pl,X) to resolve in(0, L)unioad-piane 
add-action: fly-plane(Pl,X,Y) to resolve at-plane(Pl ,Y))unioad-piane 
establish: at-object (0, X) with at-object (0, 

^ In the Prolog tradition, we use capital letters to show variables throughout the paper. 
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establish: at-plane(Pl ,X) with at-plane(Pl , 
establish: neq(X,Y) with neq(X, 

Rules such as these are then consulted by the default planner in Step 1. 
When refining a partial plan P, Sys-SEARCH-CONTROL’s planner checks to 
see if a rule exists whose preconditions and effects are subsets of P’s precon- 
ditions and effects respectively. If more than one such rule is available, then 
the rule that has the largest precondition set (i.e., it resolves the largest num- 
ber of preconditions) is selected. If more than one such rule is available, then 
Sys-SEARCH-CONTROL’s planner uses the rule whose quality- formula has the 
highest value when evaluated in context of P. A rule is guaranteed to guide 
the planner towards applying refinements that result in a higher-quality plan 
unless the partial plan has some yet unseen open-conditions that negatively in- 
teraet with the preconditions in the antecedent of the rule. A negative inter- 
action occurs if the application of a rule leads to a qualitatively lower plan. 
Sys-SE ARCH-CONTROL detects these cases and learns a more specific rule. 
The reader is directed to previous publications [15] that provide the details of 
this algorithm. 



3.4 Step 3: Learning Rewrite Rules 

As noted earlier, Sys-SEARCH-CONTROL and Sys- REWRITE differ in the use 
they make of the output of Step 2. Rather than making a search-control rule 
that will be applied during the partial-order planning process, Sys-RE WRITE 
computes (a) the actions that are added by the worse plan’s relevant decision se- 
quence. These become the action sequence to-be-replaced, (b) The actions that 
are added by the better plan’s relevant decision sequence. These become the 
replacing action sequence, and (c) The preconditions and effects of the replacing 
and the to-be-replaced action sequence. Sys-RE WRITE then stores this infor- 
mation as a rewrite rule. For instance, the rule learned for the example shown 
in Figure 3 is: 

replace : 

actions : {load-tr (D ,T,X) ,drive-tr-acities (T,X, Y) ,unload-tr (0 ,T, Y)}- 

causal-links: { 

. . in — trio ^T) . . 

load-tr (0,T,X) > unload-tr (0,T,Y) , 

, .at — tr(T,Y) , ... 

drive-tr-acities(T,X,Y) > unload-tr (0,T,Y) } 



with: 

actions : {load-pl(D ,L,X) , fly-plane (L,X, Y) ,unload-pl(0 ,L, Y)} 

Sys-RE WRITE uses a POP algorithm to generate an initial plan Pi, the set 
of casual links Clp, ordering constraints Op and the set of effects Ep. It then 
checks to see if a rule exists whose to-be-replaced sequence S\ is a subset of P 
and whose causal-link constraints Clgi are a subset of P’s causal-links set. If 
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any such rule R = {Si, PrCsi, Ef f si, Clgi, S 2 , Pres 2 , Ef fs 2 ) is retrieved, then 
all ordering constraints from Op that involve an action from Si are deleted. It 
also deletes all causal links from Clp whose producer is a members of S. All 
those conditions in the casual-links that have a producer in Si and a consumer 
in P — Si are added to the set of open conditions. The replacing action sequence 
is appended to the set of actions. The new partial plan is then refined. If all its 
flaws can be removed without adding any actions and the resulting plan P/v has 
a higher quality value than Pi then it is returned, otherwise Pi is returned. 

As Ambite and Knoblock [J point out, the performance of a rule-rewriting 
system depends on a number of factors: (1) the algorithm used to produce the 
initial plan, (2) the search algorithm used for plan-rewriting. Two search strate- 
gies are (a) first improvement generates the neighborhood incrementally and 
selects the first solution of better quality than the current one, and (b) Best 
improvement generates the complete neighborhood and selects the best solution 
within this neighborhood. 

To provide a fair comparison of the two approaches, we used the derivational- 
analogy algorithm of [4] to speed-up the generation of the initial plans for Sys- 
REWRITE. Therefore, Sys-REWRITE not only learns rewrite rules on Step 3, 
but also caches entire planning episodes. 



4 Experiments and Results 

Three domains were used for the experiments reported here: Softbot modi- 
fied transportation logistics domain m. and Minton’s process planning domain 
El- For the Softbot domain, plan quality is a function of the sum of all the 
resources consumed [T7]. For the logistics domain the quality function depends 
on the resources of time and money and is described in m- Each action had a 
fixed cost associated with it in the process planning domain. 

Dependent measures were planning effort, as a function of the number of 
partial-plans searched, and plan quality. One hundred and twenty 2-goal prob- 
lems were randomly generated for logistics and Softbot domain. For the process 
planning problems, the number of goals for each problem was randomly ranged 
between 2 and 5. The process planning domain had two objects and the goal 
was to shape them. For the logistics domain, each problem had two objects to 
deliver, three cities, three trucks and two planes. Softbot problems contained 
two persons about whom some information was sought. 

Training sets of 20, 30, 40, and 60 were randomly selected from the 120- 
problem corpus, and for each training set, the remaining problems served as 
the corresponding testing set. To identify the high quality model plan for each 
training problem, POP was run in a depth- first search mode with a depth limit 
of 15. The first 20 plan (or all possible solutions for a problem if this number 
was less than 20) were generated and the highest quality plan from these was 
used as a model plan for that problem. These were also the plans from which 
the distance was measured to compute the plan quality metric. Planning effort 
was measured by the number of new nodes expanded by each planner and plan 



250 M. Afzal Upal and Renee Elio 



quality was measured by computing the average between the quality value of 
the optimal quality plan and the quality of the plan produce by the planner 
on the test problems. Rewrite module of Sys-REWRITE-first uses the first- 
improvement search strategy and the rewrite module of Sys-REWRITE-best 
uses the best-improvement search strategy. 



number of 

training 

examples 




0 


20 


30 


40 


60 


number of 
new nodes 
expanded 


Sys-REWRITE-first 


24+0 


9.8 + 6.7 


8.3+ 6 


7+ 7 


7.8 + 5 


Sys-REWRITE-best 


24+0 


9.8 + 36 


8.9+ 44 


8.5+ 75| 


7.9 + 95 


Sys-SEARCH-CONTROL 


24 


18.3 


17.45 


17.3 


16.8 


average 
difference 
from optimal 
quality plans 


Sys-REWRITE-first 


1 


0.82 


0.84 


0.78 


0.80 


Sys-REWRITE-best 


1 


0.01 


0 


0 


0 


Sys-SEARCH-CONTROL 


1 


0.05 


0.04 


0.03 


0 



Table 1. Performance data for the process planning domain. 



Tables 1, 2 and 3 show the performance of Sys- RE WRITE and Sys-SE ARCH- 
CONTROL on Softbot, process-planning and transportation domains, respec- 
tively. The new nodes expanded by Sys-RE WRITE are shown as N + M, where 
N is the number of nodes expanded by the default planner and M is the number 
of nodes expanded by the rewrite-module (i.e., the number of nodes required to 
refine the flaws introduced by applying rewrite rules to the initial plan). The 
two counts are represented separately because the rewrite nodes are slightly less 
costly than the planning nodes. 



number of 

training 

examples 




0 


20 


30 


40 


60 


number of 
new nodes 
expanded 


Sys-REWRITE-first 


36 


14+132 


14+156 


13+124 


12+127 


Sys-REWRITE-best 


36 


14+14212| 


14+215181 


13+220201 


12+22788 


Sys-SEARCH-CONTROL 


36 


12.5 


13 


12 


11 


average 
difference 
from optimal 
quality plans 


Sys-REWRITE-first 


1 


0.95 


0.96 


0.94 


0.92 


Sys-REWRITE-best 


1 


0.85 


0.74 


0.72 


0.70 


Sys-SEARCH-CONTROL 


1 


0.03 


0.02 


0.01 


0 



Table 2. Performance data for the transportation domain. 



For all three domains, both rewrite and the search-control rules lead to sig- 
nificant improvements in plan quality. As expected, the quality of the plans pro- 
duced by Sys-REWRITE-best is higher than those produced by Sys- REWRITE- 
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number of 

training 

examples 




0 


20 


30 


40 


60 


number of 
new nodes 
expanded 


Sys-REWRITE-first 


10.4+0 


3.4 + 21 


3.0+ 22 


2.5+25 


2.1 + 24 


Sys-REWRITE-best 


10.4+0 


3.4 + 86 


3.0 +96 


2.5+108 


2.1 + 126 


Sys-SEARCH-CONTROL 


10.4 


3.03 


3.0 


2.44 


2.1 


average 
difference 
from optimal 
quality plans 


Sys-REWRITE-first 


1 


0.67 


0.65 


0.59 


0.60 


Sys-REWRITE-best 


1 


0.22 


0.18 


0.14 


0.13 


Sys-SEARCH-CONTROL 


1 


0.55 


0.47 


0.14 


0.12 



Table 3. Performance data for the softbot domain. 



first. It is interesting to note, however, that for all three domains quality im- 
provements obtained by using search-control rules are comparable or better than 
those obtained by rewrite rules. Recall that Sys-Rewrite-best does an exhaus- 
tive consideration of the complete neighborhood, and that is reflected in the 
2 — 2000-fold increase in node expansion over Sys-SEARCH-CONTROL. For 
Softbot, Sys-REWRITE-best expands about 30 times times more nodes than 
Sys-SEARCH-CONTROL. For transportation domain, which is the most com- 
plex one, planning by rewriting system need to search thousands of extra nodes 
without finding better quality plans than those produced by the search control 
system. For all that work, no improvement in quality! 



5 Related Work 



The basic idea of learning search-control rules to speed-up problem solving can 
be traced back to the early work on EBL [I21III]- Minton’s [H] PRODIGY/EBL 
learned control rules by explaining why a search node leads to success or failure. 
Kambhampati et al. [2l propose a technique based on EBL to learn control rules 
for partial-order planners and apply it to SNLP and UCPOP to learn rejection- 
rules. Ihrig et al. [1] extended SNLP-I-EBL to learn from planning successes as 
well as failures. However, these systems only aim to improve planning efficiency 
and not plan quality. There has been some work on the PRODIGY project for 
learning control rules to improve plan quality However, such work has 

been limited to state-space planners. 

Zimmerman et al. OS! and Estlin and Mooney [2] present two inductive learn- 
ing techniques to learn search control rules for partial order planners. SCOPE 
|2] uses inductive logic programming techniques whereas Zimmerman’s system 
uses a neural network to acquire search control rules for UCPOP. 

Ambite and Knoblock PQ coined the term planning by rewriting. Their sys- 
tem, PBR, used a small number of hand-coded rewrite rules for the Block’s 
world, the process planning domain and the query planning domain to improve 
the quality of the plan produced by SAGE [S], a partial-order planner. 
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6 Conclusion 

Much work has been done to improve planning efficiency. The work reported 
here is concerned with methods to improve the quality of plans that work and 
specifically. In previous work we demonstrated a learning algorithm that can 
identify search-control rules that can guide plan-refinement decisions towards 
producing higher-quality plans. Ambite and Knoblock [T] argue that it may 
be more practical to generate a low-quality plan efficiently, and then ’’fix” the 
quality of the plan with some after-the-fact, hand-crafted rewrite rules. Thus, 
it was natural for us to ask two questions: (a) can such hand-crafted rules for 
improving plan quality be learned and (b) if so, how do such learned rules stack 
up against search control rules in producing high-quality plans. In this paper, 
we presented a method for learning rewrite rules based on the same framework 
for learning search control rules. Our data indicates that higher quality plans 
are produced by using the search control rules than those produced by using the 
rewrite rules, and at a considerable efficiency savings. 
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Abstract. One of the key issues in automated theorem proving is the 
search for optimal proof strategies. Since there is not one uniform strat- 
egy which works optimally on all proof tasks, one is faced with the dif- 
ficult problem of selecting a good strategy for a given task. Strategy 
parallelism is a way of circumventing this strategy selection problem. 
However, the problem of selecting the parallel strategies and distributing 
the available resources remains. Therefore we have developed a method 
for strategy evaluation and selection based on training data. We present 
a theorem prover system which has been automated with respect to 
the entire process of theorem prover application including automatic 
data generation, automatic schedule selection and classical automated 
theorem proving. In the theorem prover e-SETHEO, we present an im- 
plementation of such a system that, for the first time, can handle the 
necessary problem domain adaption fully automatically and which is an 
improvement of the prover which solved the largest number of problems 
in the MIX division of the GADE-16 ATP competition. This is followed 
by some experimental data produced with this system. We address the 
problem of test set extraction and give an assessment of our work as well 
as a lookout to future research issues. 



1 Introduction 

Automated Theorem Proving (ATP) is the subfield of computer science dealing 
with the automatic verification of the validity of logical formulae. Attempting 
to prove the validity of such formulae automatically, particularly beyond simple 
textbook examples, typically results in a tremendously large search space. Such 
a search problem is usually solved by a uniform search procedure. In automated 
deduction, different search strategies may behave significantly different on a 
given problem. Unfortunately, in general, it cannot be decided in advance which 

* This work was supported by the Deutsche Forschungsgemeinschaft (DFG) as part of 
the Sonderforschungsbereich 342. 
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strategy is the best for a given problem. This motivates the competitive use 
of different strategies, especially when the available resources are restricted. In 
order to be successful with such an approach, the strategies must satisfy the 
following two conditions. Sub-linearity: Let sol{s,t) denote the set of problems 
solved with a strategy s in time t. Then, for a typical set of problems, the 
function must be sub- linear, i.e., with each additional time interval fewer 

new problems are solved. Complementarity: The competing strategies must be 
complementary w.r.t. a given problem set, i.e., the sets of problems solved in a 
certain time limit by two different strategies should differ significantly. 

If both conditions are satisfied, then a competitive use of different strategies 
can be more successful than the best single strategy. 

This paper is organized as follows. In Section [2] we give a brief overview 
on automated theorem proving and relate strategy parallelism with other par- 
allelization methods in automated deduction. We briefly address problems like 
partitioning and strategy allocation. Furthermore, in this Section we give an 
outline of the design and the implementation of our strategy parallel theorem 
prover e-SETHEO. Then, in Section[2]we describe our data generation, strategy 
evaluation and selection principles. Section Ulpresents some experimental results 
obtained with this system. The extraction of test sets and the effects of test set 
reduction on the prover performance are discussed in Section [21 This is followed 
by an outlook on future development and an assessment of our current work in 
Section |6] and, finally, a conclusion in Section 0 

2 Strategies and Strategy Parallelism; a Framework for a 
Strategy Parallel Prover 



Many ways of organizing parallel computing have already been studied. However, 
many of these methods do not apply to automated theorem proving, since it is 
generally impossible to predict the size of each of the parallelized subproblems 
and it is therefore very hard to create an even workload distribution among the 
different agents. Here, we cite some of the successful examples. 

A central concept in our work is the (search) strategy. For us, a strategy is 
one particular way of traversing the search space. From our practical point of 
view, a strategy is a calculus for automated theorem proving combined with a 
particular search method. 

We are now looking for a way of efficiently combining and applying different 
strategies in parallel. An example can be found in the nagging concept [SSh4a,| : 
dependent subtasks are sent by a master process to the naggers, which try to 
solve them and report on their success. The results are integrated into the main 
proof attempt. A combination of different strategies is used within the team- 
work concept of DISCOUNT |DKS97] for unit equality problems. These strate- 
gies periodically exchange intermediate results and work together evaluating 
these intermediate results and determining the further search strategies. Strat- 
egy selection techniques are applied even in systems with finite search spaces like 
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EUREKA |CV98| . Partitioning of the search space is done. e. g. in PARTHEO 

|M)1 . 

Some of these approaches are very good in certain aspects. Partitioning, for 
example, can guarantee that no part of the search space is considered twice, 
therefore providing an optimal solution for the problem of generating “signif- 
icantly” differing search strategies. The fundamental weakness of partitioning, 
however, is that to provide completeness it requires the reliability of all dis- 
tributed agents. Therefore, we have investigated a competition approach. Differ- 
ent strategies are applied to the same problem and the first successful strategy 
stops all others. However, not all strategies are equally promising or require equal 
effort. It is therefore advisable to divide the available resources in an adequate 
way. 

The selection of more than one search strategy in combination with tech- 
niques to partition the available resources such as time and processors is called 
strategy parallelism jWL99j . Different, competitive agents traverse the same 
search space via different, ideally non-overlapping paths. Such a selection of 
strategies together with a resource allocation for the strategies is called a sched- 
ule. It is intended that these strategies should traverse the search space such 
that, in practice, the repeated consideration of identical parts is largely avoided. 
One of the key problems of strategy parallelism is the optimal distribution of 
resources within a schedule. Our approach to this problem will be discussed in 
Section 

Using SETHEO as the basic underlying inference machine, we have developed 
p-SETHEO, a prototypical implementation of a strategy parallel theorem prover. 
This system in the meanwhile has been further developed into e-SETHEO, the 
most important improvements being the augmentation by the new E prover, a 
superposition calculus equality prover [Sch99j and the employment of FLOTTER 
[WGR961 as a conversion procedure from full first order logic to conjunctive 
normal form (CNF). We have used e-SETHEO to collect experimental data and 
successfully participated with that system in the ATP system competition at 
the CADE- 16 conference. 

Example 1 (ATP search strategy). To give a better idea of what the term strat- 
egy means in the context of e-SETHEO, we will give an example. One of the 
strategies employed by our system is the following: 

stexposu -rel -replnum 

inwasm -foldup 

sam -dynsgreord 2 -wdr 

This strategy implements a particular application of the model elimination 
calculus. First the formula preprocessor stexposu performs a relevance trans- 
formation on the input, then the formula is compiled by the formula compiler 
inwasm into abstract machine code including special code enabling the fold up 
refinement, and finally, the model elimination proof is started by the virtual 
machine of SETHEO, sam, using a weighted depth bound and dynamic sub- 
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Fig. 1. Schematic view of the functionality of e-SETHEO. 



goal reordering, thus defining the search behaviour. These techniques and other 
refinements of model elimination are discussed in | Let98| . 

While in the early stages of development nearly all strategies had been vari- 
ants of SETHEO (that is, e-SETHEO would invocate different instantiations of 
SETHEO with different parameterizations), in the meanwhile we have begun to 
incorporate a much wider variety of different strategies covering special problem 
classes. e-SETHEO currently employs 112 such strategies, about 80 of which are 
variants of model elimination. Other types of strategies use the superposition 
calculus or propositional decision procedures. Often, certain characteristics of 
a proof task that are generally easy to recognize imply a special treatment of 
this task with a special strategy. For example, if the problem is either ground 
or can be grounded, it is possible to apply a propositional theorem prover to 
that problem, which is not only very fast but additionally implements a decision 
procedure for this kind of problem, allowing fast detection of non-theorems as 
well. E-SETHEO is able to incorporate all state of the art ATP systems, pro- 
vided these ATP systems adhere to minimum standards regarding issues such 
as resource allocation or input-output behavior. 

During a prover run, e-SETHEO performs its proof tasks in a number of 
distinguishable steps, as is depicted in Figure[Tl The first step is problem analysis 
and CNF conversion. This is followed by the strategy allocation, where a schedule 
is defined for the selected strategies. Finally, the actual parallel prover runs are 
started. 
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Fig. 2. Schematic view of the e-SETHEO autoconfiguration. 



3 Data Generation and Strategy Evaluation and Selection 

Automated theorem provers are not yet powerful enough to successfully tackle 
new problem domains without prior adaption. As a matter of fact, all state of 
the art systems taking part in the CADE ATP systems competition had been 
tuned to some degree to the TPTP problem set. This tuning, however invariably 
involved a lot of manual work and manpower. Presenting a new approach, our 
prover system is fully automated, especially with respect to the aspects of auto- 
matically generating the data necessary for the strategy evaluation, automatic 
selection of the appropriate strategies and resource assignment to the selected 
strategies and the employment of automated (in the canonical meaning) theorem 
provers. These three steps all consist of several phases. A principal scheme of 
the functionality implemented in all steps is given in Figure |2] 

We begin with the first step, the automated data generation. We capture the 
problem of determining an optimal schedule by using a set of training examples 
from the given domain and optimizing the admissible strategies for this training 
set. Given a set of training problems, a set of usable strategies, a time limit 
and a number of processors, we want to determine an optimal distribution of re- 
sources to each strategy, i.e., a combination of strategies which solves a maximal 
number of problems from the training set within the given resources. Unfortu- 
nately, even the single processor decision variant of this problem is strongly NP- 
complete |WL99j . In practice, we therefore use suboptimal schedules which we 
obtain by using a set of training examples from the given domain and optimizing 
the admissible strategies for this training set with the aid of a genetic algorithm 
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fHol92IISW9^ followed by an application of a discrete gradient procedure to the 
best individuals. Providing the necessary training data, however, is a very exten- 
sive task. Given a set of training problems, a set of usable strategies, a time limit 
t and a number of processors, we want to determine a combination of strategies 
which solves a maximal number of problems from the training set within the 
given resources. To compute this combination, we have to determine for all ad- 
missible strategies S the solution times (within t) on all problems P from the 
training set. In our experiment, we employed 50 workstations for several weeks 
to determine all necessary data on a training set of about 4000 problems and 
a set of 112 strategies. The workstations were organized in a loosely connected 
cluster with shared disk memory. Additionally, we used a controller workstation 
separated from the cluster. 

Due to the needs of a shared and distributed computing environment, the 
data generation system needs to have some means of balancing and limiting the 
load produced by our experimental setup, so other users will not be needlessly 
encumbered. Therefore we employ a Performance Evaluator, which is responsible 
for the generation and maintenance of a data base containing all data necessary 
for the evaluation of the expected performance and the expected free resources 
on all involved machines. We limit the number of prover processes running simul- 
taneously on the same processor as well as the maximum load and the minimum 
amount of free memory allowed before starting a new prover task. The Task Gen- 
erator maintains a list of tasks to be treated. Using inquiries on the available 
strategies, on the available problems and on the strategy-problem-pairs which 
already have been finished properly. Task Generator generates a to-do-list which 
is given to the Task Scheduler component. This Task Scheduler launches all the 
tasks from the to-do-list as soon as the Performance Evaluator provides usable 
hosts. 

If a certain task finished correctly, i. e., without errors caused by the operation 
system or other users, this fact and the generated data is recorded and used for 
the generation of the data matrices required for the genetic gradient algorithm 
as well as for providing re-entry points necessary when the whole data generation 
system has to be restarted, e. g., after re-booting the controller workstation or 
a general network failure. An abstract view of the data generation system can 
be seen in Figure 0 

The second step is the automated strategy evaluation and schedule determina- 
tion. The multitude of settings of the basic SETHEO inference machine and the 
considerable number of additional prover tools employed by e-SETHEO result 
in a vast number of different configurations in which the prover system can be 
used. It is obvious that it is not feasible to test all these possible configurations 
for their performance on a given problem domain. However, using heuristics, 
intuition and experience, a number of about one hundred of these configurations 
has been identified as potentially useful and implemented as strategies. Still, 
having a hundred different strategies to choose from (and to distribute resources 
among), trying to obtain an optimal solution for resource allocation would be 
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Fig. 3. Schematic view of the data generation system. 



futile. Therefore, we choose our schedules from a number of pseudo-optimal so- 
lutions we acquire in a three-phased process. 

In a first phase, all given problems are divided into a small number of classes 
according to some very simple discrimination criteria. At the moment, we use 
nine syntactic classes, which are: 

— the class of all groundable problems, 

— the unit equality problems, 

— the pure equality problems, 

— the Horn problems without equality literals, 

— the Horn problems with equality literals, 

— the non-Horn problems with equality literals presented in a short formula, 

— the non-Horn problems with equality literals presented in a long formula, 

— the non-Horn problems without equality literals presented in a short formula, 

— the non-Horn problems without equality literals presented in a long formula. 

In a second phase, for each of these classes a set of schedules is evaluated by the 
genetic algorithm described in |SW99| . The best of these schedules are selected 
for refinement in the third phase by applying the gradient method explained in 
jWol98b| . This process results in a set of pseudo-optimal schedules, one for each 
syntactic class, that are then used for configuring e-SETHEO. 

The third and last step is the automated proof search in the literal sense, 
where an actual problem is treated at run-time using the pre-computed schedule 
of the syntactic class the problem belongs to. 
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4 Experimental Results 

Our experiments were conducted in two phases. First we intended to verify the 
feasibility of the approach described in Section with reduced problem and 
strategy sets. Therefore we used the 547 eligible TPTP problems | SSY94| of the 
theorem prover competition at the 15th Conference on Automated Deduction 
in 1998 as our training data set. Our participating prover p-SETHEO IWol98al 
employed 91 different strategies at that time, these formed our strategy set. We 
extracted all these strategies and ran each strategy on all problems using the 
standard sequential SETHEO |MTL+97] . The successful results of all those runs 
were collected in a single list that became the database for our genetic gradient 
algorithm. 398 problems can be solved by at least one of the strategies in at 
most 300 seconds. Then we ran the genetic algorithm followed by the gradient 
procedure on the collected data. The success of each of the schedules, as the 
individuals of our genetic algorithm, was evaluated by looking up the list entries 
for the problems and respective strategies and time resources. 

In our experiments we used the combination of the genetic algorithm and the 
gradient procedure, as described above. The attributes of the initial generation 
that are selected at random strongly influence the overall results of the experi- 
ment. The deficiencies of an unfit initial generation can not be wholly remedied 
by the subsequent optimizations. Therefore all experiments were repeated at 
least ten times. The curves and tables depicted in this section represent the 
median results. 




Generations 



Fig. 4. Number of problems solved by the schedule resulting from the genetic 
algorithm depending on the number of generations for populations of 10, 20, 40, 
and 160 individuals. 



Figure m shows the number of problems solved after 0 to 100 generations for 
10, 20, 40, and 160 individuals (numbers at the curves) in 300 seconds on a single 
processor system. 

Our experimental results showed only a poor scalability for our actual prover 
system. That was due to the very limited number of training problems. Further- 
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more, many of the used strategies overlap one another (see [IWL99j i. Still, all the 
above figures indicated that the genetic gradient approach is extremely useful for 
automatically configuring a strategy parallel theorem prover, and therefore we 
started the data generation for the entire problem and strategy sets. We tested 
each of the 110 strategies on all 4004 problems of the latest version of the TPTP 
problem library [SSY94I . These test runs were conducted on the workstation 
cluster as described in Section with a time limit of 300 seconds per problem 
and strategy and took about two months; the results of these tests can be seen 
in the second and third columns of Table (T) Using the data from these runs we 
generated a set of pseudo-optimal schedules for the different problem classes. 
After having configured e-SETHEO with these schedules, we tested e-SETHEO 
on the TPTP problems again, the number of problems solved by the strategy 
parallel prover is shown in the fourth column of Table [U with the fifth column 
giving the number of strategies contained in the respective schedules. 



Problem Class 


problems 

in 

class 


^ solved 
by some 
strategy 


# solved 
by best 
strategy 


/fi solved 
by 

schedule 


strategies 

in 

schedule 


Groundable 


753 


685 


410 


682 


4 


Unit Equality 


446 


365 


330 


341 


5 


Pure Equality 


132 


96 


85 


93 


6 


Horn w. Equality 


226 


183 


175 


183 


3 


Horn w/o Equality 


373 


293 


275 


291 


3 


non-Horn w/o Eq. (large) 


268 


98 


74 


96 


6 


non-Horn w/o Eq. (small) 


266 


227 


191 


227 


6 


non-Horn w. Eq. (large) 


841 


140 


78 


132 


9 


non-Horn w. Eq. (small) 


699 


445 


317 


416 


12 


TOTAL 


4004 


2532 


2201 


2461 


- 



Table 1. Results in number of proofs found for TPTP v2.2.0 (1 processor, 300 
seconds) 



The experiments in this section are explained in greater detail in |SW99| and 
|SW99a| . 

5 Test Set Extraction 

The applicability of the system described in the previous sections strongly de- 
pends on the amount of time required to configure the prover. These time require- 
ments are dominated by the data generation process and, to a lesser degree, by 
the genetic-gradient procedure used to configure the prover system. In an appli- 
cation environment, it is necessary to minimize the data generation and training 
period as much as possible while maintaining a sufficient overall performance of 
the final system. There are several ways such a minimization can be done. One 
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might reduce the time allowed for each prover run during data generation. But 
this method is not scalable beyond a certain minimum amount of time necessary 
to obtain meaningful results (still all problems would have to be evaluated even 
for very large domains). Additionally, the number of prover runs necessary for 
the data generation phase is linear in the number of problems while the genetic 
gradient procedure is cubic in the number of test problem results to be evaluated. 
Therefore we chose instead to reduce the number of problems from the problem 
domain that were to be used for the data generation process. So a method is 
required for extracting a representative subset of the problem domain to be used 
for data generation in such a way that the resulting prover configuration will 
work sufficiently well on the entire problem domain. 

We decided to try a simple randomized approach to this problem: Using the 
given categorization of problems into nine problem classes as described in Section 
121 we determined the sets of problems belonging to each class. From these, we 
randomly selected subsets of varying sizes and used these subsets for configuring 
our prover system in the way specified in Section [21 

We tested the validity of this approach by doing the following for each prob- 
lem class: First we extracted random subsets consisting of 10, 20, 30, 50 and 75 
percent of the problems in the problem class that could be solved by at least one 
strategy. In order to cross-validate the results, we repeated this step 10 times, 
obtaining a total of 50 random subsets. Then we generated and optimized one 
schedule for each of these random subsets and finally tested the performance of 
all schedules on the entire respective problem class. These numbers are compared 
with our optimized schedules based on the entire TPTP library. The results of 
these experiments are depicted in FigureO For each problem class one diagram is 
given containing three curves. The upper dashed curve indicates the performance 
of the best of the 10 schedules, the lower dashed curve shows the performance of 
the worst of the 10 schedules and the solid curve gives the performance figures 
for the median element in each schedule set. The values for 100 percent of the 
test set are given by the schedules computed from the entire problem classes and 
used in e-SETHEO. 

The “dents” of non-monotony showing in some of the curves have their origin 
in the random selection of the test set. If by accident a highly non-representative 
test set is produced this cannot be corrected entirely by the ensuing schedule 
optimization. It becomes apparent from the data displayed in Figure [5] that 
the problem classes differ in the homogeneity of their problems with respect to 
search space behaviour. While for the class of groundable problems a test set of 10 
percent is enough to obtain 98 percent of the performance of the schedule based 
on the full problem domain (with the best of the 10 schedules), for the classes 
of non-Horn problems with equality to achieve those 98 percent performance 
the test set has to comprise 30 percent of the problem domain. However, these 
figures reflect our experience that it is easier to find well performing strategies for 
the class of groundable problems than for the classes of non-Horn problems with 
equality, a fact under-pinned by the respective numbers of strategies given for 
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Fig. 5. The performance of schedules trained on subsets of the respective prob- 
lem classes. 



the schedules in Table [U Nevertheless, even our fairly straightforward approach 
can achieve 98 percent of the performance with only 30 percent of the test data. 

6 Assessment and Future Work 

The performance of automated theorem provers can be improved by the intro- 
duction of strategy parallelism in combination with automated resource allo- 
cation. While in theorem proving the system developer or advanced user often 
can tune the system by a suitable selection of parameters, this is not possible 
if the theorem prover is to be integrated into a larger proof environment like 
ILF |DGHW9T] or if the prover is to be applied to a new problem domain of 
unknown characteristics. In this case the configuration of the theorem prover 
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must be done automatically and the use of strategy parallelism lends itself to 
the solution of this problem. We have introduced a system that, presented with 
a set of problems, automatically configures itself to match the problem set and 
solves the given proof tasks. Nevertheless, the resulting system still remains suf- 
ficiently universal, as has been demonstrated by the results in |VorOO| . Following 
the CADE- 16 ATP systems competition, Andrei Voronkov has tested the partic- 
ipating systems on altered versions of the competition problems, which he had 
changed in ways that should render excessive prover tuning ineffective. Despite 
the fact that there had not been an opportunity to adapt e-SETHEO to the 
altered problems, our system did very well in those circumstances, still solving 
the most problems. Even though the methods used and the results presented in 
jVorOO] will be the subject of further discussion, we consider it as evidence in 
support of our paradigm and the paper suggests that our approach generalizes 
reasonably well. 

Our implementation of e-SETHEO has moved away from a parallel frame- 
work for SETHEO towards a generic tool for parallel theorem proving, able to 
incorporate practically any state-of-the-art theorem prover. 

In this paper we did not address a variety of issues that will be the subject 
of future research: 

The sizes of the training data sets need to be minimized. Given the large 
number of different strategies involved, this is a prerequisite for fast domain 
adaption. Another problem that remains to be solved is the categorization of 
problems into problem classes. The fixed problem classes used hitherto have 
worked reasonably well in our given research context, but still a more generic 
approach to the categorization problem should be investigated as well. The cat- 
egorization of problems according to their syntax remains a reasonable idea but 
has to be supplemented by other methods. With each problem represented by 
a weighted sum over its syntactic values, machine learning techniques could be 
employed to classify each problem in a subclass of problems showing similar 
search space behaviour. Further, one might use the existing knowledge about 
the problems in a given class to make reasonable subdivisions. However, knowl- 
edge of that kind cannot be extracted by purely syntactical means and would 
require building a knowledge base of previous successful proof attempts, as it 
is already done in the equality component E. Finally, one last issue deserving 
our attention is the refinement of the test set extraction. While the results we 
obtained are promising enough, the random component would usually require 
the configuration routine to be repeated several times to smooth out the results. 
How can we determine what a representative problem from a certain domain 
should look like? Such knowledge might reduce the effort necessary to configure 
the prover system as well as help minimizing the required test set size. 

7 Conclusion 

The search procedures of ATP systems are not generic enough to be applied to 
new problems without modification. Therefore we consider the adaption of the 
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prover system to the problem domain a basic necessity. With the prover system 
e-SETHEO, we have implemented for the first time a system that adapts itself 
to the problem domain in a fully automated way. Even though it is true that 
e-SETHEO has been tuned to perform well on the TPTP library of problems, 
due to the generic nature of our tuning mechanism we are optimistic that we 
can adapt e-SETHEO to perform well on arbitrary sets of application problems. 
And given the results we obtained with test sets of limited size we are also very 
optimistic about the scalability of our approach to large problem domains. 
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Abstract. We have developed a new efficient neural network-based al- 
gorithm for Alife application in a competitive world whereby the effects 
of interactions between organisms are evaluated in a weak form by ex- 
ploiting the position of nearest food elements into consideration but not 
the positions of the other competing organisms. Two online learning algo- 
rithms, an instructive ASL (adaptive supervised learning) and an evalua- 
tive feedback-oriented RL (reinforcement learning) algorithm developed 
have been tested in simulating Alife environments with various neural 
network algorithms. Adopting an adaptively selected best sequence of 
feedback action period Aa which we have found to be a decisive param- 
eter in improving the network efficiency, the ASL-guided FuzGa had an 
improved performance as compared with ASL-guided CasCor and RL- 
guided FuzGa. We confirm that the present solution successfully evalu- 
ates the effect of interactions at a larger Fa (food availability), reducing 
to an isolated solution at a lower value of Fa- 



1 Introduction 

Alife is a study of man-made systems that exhibit behaviors characteristic of 
natural living systems [9]. Observing that an individual Alife organism passes 
through a sequence of phenotypical forms growing from an initial cell (the egg) to 
the adult form, we need a constructive algorithm for simulating neural networks- 
based Alife so that its gradual constructive process observed is to be reflected 
in the system architecture. For instance, Fullmer and Miikkulainenj^’s model 
is based on a neural network architecture with marker-based genetic encoding 
scheme while Dellaert and Beer [I] used a neural development process to evolve 
an organism’s neural network. However, the construction methods mentioned 
involve an evolutionary process thus requiring a considerable length of chromo- 
somes encoding and consequently considerable computational time and memory 
space. 

Another approach such as CasCor [Zj exploits a constructive neural network 
design which starts with a small network and then grows additional hidden 
neurons and weights until a satisfactory solution is found. It is widely used in 
pattern classification and function approximation. However, a difficulty remains 
with these methods in determining a reasonable number of network layers on 
which a given problem critically depends. 
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The purpose of this paper is to develop a new learning algorithm for possible 
use with the constructive neural network algorithms in a competitive environ- 
ment evaluating interactions between organisms. We develop two on-line learn- 
ing algorithms, an Adaptive Supervised Learning (ASL) and a Reinforcement 
Learning (RL) algorithm which are intended to guide the learning mechanisms 
by specifying desired action patterns to each of organisms or by considering the 
whole problem of a goal-directed agent by evaluative feedback respectively. The 
ASL uses fuzzy logic technique to dynamically produce an adaptive supervised 
learning to implement desired action patterns effective to each of the individ- 
ual organisms comprising the entire group. This should be compared with the 
goal-directed RL where a reinforcing is activated whenever a feedback action is 
followed by a satisfactory state of affairs, reflecting an integrated effect of the 
organisms’ interactions. Our task is to find a best combination of a constructive 
FNN system with the appropriate learning algorithm such that we are able to 
present a best averaged fitness value with least standard deviation so that a 
majority of organisms can survive in the given competitive environment, where 
we define the fitness value as the number of food elements eaten by an organism 
during its lifetime. 



2 Modeling Alife Forms 

In Part I [6] of this paper, we have successfully modeled an Alife organism by 
neural networks in an isolated world. How will a competitive environment differ 
from an isolated world to the proposed FuzGa? We examine this by setting up 
a competitive environment where a population of 100 independent organisms 
search for randomly distributed 1,000 food elements each occupying a single cell 
within the world of 100 x 100 lattice cells. 

The organisms in the competitive world are all endowed with a rudimentary 
sensory system sensing the angle and distance to the nearest food element as 
inputs from the environment. As in Nolfi and Parisi[^, the modes of output 
actions in a single action will be restricted to the following four modes which 
are represented by binary representations: 00=go one cell forward, 10=turn 90 
degrees right and then go one cell forward, 01=turn 90 degrees left and then go 
one cell forward, ll=turn 180 degrees and then go one cell forward. 

When an organism steps on a food cell, it eats the food element and this 
element will disappear. All the food elements will be periodically reintroduced 
at every 100 actions of the organisms. As in FuzGa[Sj, each organism’s nervous 
system is modeled by a fuzzy neural network(FNN) comprising three layers of 
neurons. The FNN has 2 input variables labeled as Xq,Xi which give the an- 
gle (measured clockwise from the organism’s facing direction) and the Euclidean 
distance to the nearest food element (both values are scaled from 0.0 to 1.0), 
and 2 output variables labeled as Yq,Yi which are encoded by the binary repre- 
sentations above, implying that all output values from the FNN will be rounded 
to either 1 or 0. We start with an initial architecture of four fuzzy intervals 
{East, South, West, North} for Xq, labeled as {Ao,o, Aop, Aq, 2, ^0,3} and two 
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fuzzy intervals {Far, Near} for X\, labeled as {Xifi,Xi^i}. Figure 1 shows the 
initial architecture of the FNN which grows into a reasonably sized architecture 
increasing the fuzzy intervals on Xq and Xi. 

Here ni(i=0jlj2,...,7) is the multiplication neuron, is the summa- 

tion neuron, while ^o,i(-^o)(j=0,l)2,3) and j (Aii)(j=0,l) represent the respec- 
tive degree of membership function for Xq and Xi for the given fuzzy intervals. 
Lij (i=0,l,2,...,7; j=0,l) is the connection weight between the second layer and 
third layer. fx{x) = ). 

When input Xq and Xi are given, the neurons in the third layer output: 

Yj = ^J' 0 ,^-lxro{Xo) X pi,i{Xi) X where j = 0, 1; ro and 

ri are the respective number of current fuzzy intervals on Xq and Xi both of 
which increase from the initial interval 4 and 2 as FuzGa grows respectively. We 
now develop below the learning algorithms appropriate to dynamically changing 
environments due to interactions from other organisms. 

3 Online Learning in a Competitive Environment 

In a competitive environment, FuzGa-driven independent organisms must com- 
pete for food with their search strategy being influenced by interactions from 
the other organisms. FuzGa requires a new learning mechanism from those used 
in an isolated environment because in addition to acquiring knowledge from the 
individual’s experience as in the isolated environment, the learning mechanism 
should be capable of taking the direct interactions between the organisms into 
consideration. Our task is, on all accounts, to implement a best averaged fitness 
value (the number of food elements eaten by an organism in one generation) 
with least deviations among the organisms. In this section, we present an ASL 
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(adaptive supervised learning) and an RL (reinforcement learning) paradigm. 
They are both on-line learning methods and provide a form of feedback from the 
environment to evaluate an error metric used in the FuzGa. 

3.1 Adaptive Supervised Learning 

How are we going to take the dynamically changing distribution of food elements 
as well as organisms’ positions in implementing the desired actions of each or- 
ganism? Various strategies can be developed in this regard but as a first step to 
developing the integrated learning algorithm, we must emphasize here another 
important aspect that a forcing term derived from the desired actions developed 
must be stable to ensure the robustness of the solution. To cope with rapidly 
changing desired action patterns which are caused by sudden captures of the 
target food elements by competing organisms, we have decided to introduce a 
temporal averaging process in terms of feedback action period Aa, avoiding to 
update the connection weights of the networks too often which more often than 
not would result in an inefficient, unstable state. We present an online learning 
algorithm below along the line which we think to be best in the competitive 
situations. 

Designing ASL-Guided Desired Actions. Our strategy for designing an 
ASL is to exploit the fuzzy logic technology for supplying a pattern of desired 
action from a dynamically changing competitive environment only at every Aa 
actions of an organism. Effectively taking temporal averaging of rapidly changing 
desired action patterns, the adaptive feedback action period Aa selected is found 
to be most effective in implementing an averaged optimal fitness value of the 
entire organisms in a competitive environment. The FuzGa procedure we have 
developed will be described in details below. 

(1) . When the nearest food is located to “East” of the organism, move for- 
ward one cell toward the East, where “EasE denotes an interval of [7 /47 t, 27t) U 
[0, 1/47t) while the East the direction of 0 or 27t. In terms of facing directions, 
this breaks down to 

(a) , if the organism’s facing direction is left, the desired action is ‘11’ that 
means turning 180 degrees and then going one cell forward (see section 2 for an 
explanation of the binary representation); 

(b) . if the organism’s facing direction is right, the desired action is ‘00’ that 
means going one cell forward; 

(c) . if the organism’s facing direction is up, the desired action is ‘10’ that means 
turning 90 degrees right and then going one cell forward; 

(d) . if the organism’s facing direction is down, the desired action is ‘01’ that 
means turning 90 degrees left and then going one cell forward. 

(2) . When the nearest food element is located to “Aort/i” of the organism, 
move forward one cell toward theNorth, where “fVort/i” = [l/47r, 3/47t). Binary 
representations are hereinafter skipped. 

(3) . When the nearest food element is located to “West” of the organism, move 
forward one cell toward the West, where “lFest” = [3/47r, 5/47t). 
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(4). When the nearest food element is located to “South'^ of the organism, move 
forward one cell toward the South, where “S'oMt/i” = [5/47r, 7/47t). 



Simulating FuzGa-Guided Alife by ASL Learning Algorithm. Among 
many neural networks tested including CasCor, FlexNet, MLPs and Pruning[B], 
the constructive compound neural networks FuzGa[B] we have developed in Part 
1 of this paper has proved to be efficient in both time complexity and network 
performance when used in simulating a given group of artificial life organisms 
in an isolated environment. FuzGa starts with an initial 3-layer neural network 
and goes through repeated learning processes first by a steepest decent learning 
method (SDLM) of fuzzy logic and then by genetic evolution whenever a stag- 
nation is hit so that a new reproduced fuzzy network from selected organisms 
coupled with a mutation of the network connection weights is again subjected to 
SDLM. A situation changes in the competitive environment, however, because 
due to dynamic changes in the position of nearest food elements which are often 
captured by other competing organisms, the organism is forced to change the 
target food elements. Depending on the food element density nearby, the imme- 
diate feedback action may not be optimal. Gonsider, for example, the situation 
with the dense food elements where the rapidly changing position of nearest food 
elements does not change too much. Apparently too frequent feedback actions in 
changing network configurations including their weights are costly resulting in 
inefficient and unstable performance. We show that the adaptive choice of given 
feedback action period Aa plays a key role in optimizing the performance of the 
algorithm. 

In the competitive environment of section 2, each of the 100 organisms is 
designed by a 3-layer FNN architecture of Fig. 1 with connection weights being 
chosen at random initially. All the organisms are introduced into an arbitrary 
position of the 100*100 cells to be assigned randomly and they are to compete 
for food elements simultaneously. This simulation process has been implemented 
by Java thread functions to run in parallel with each of the 100 organisms taking 
the following Step 1~5 at each action: 

Step 1. Sense the angle and distance to the nearest food element as the input 
Xq and Xi from the competitive environment; 

Step 2. Map each of the input Xq and Xi to a value between 0 and 1; 

Step 3. Galculate the output Yq and Yi of the organism’s FNN; 

Step 4. Threshold each of the output Yq and Fi to 0 or 1. This will produce 
one of the four kinds of the FNN’s output patterns in the binary representations 
‘00’, ‘01’, ‘10’ and ‘11’ of section 2. 

Step 5. Perform the action designated by the FNN’s output and increase the 
number of food eaten by the organism by 1 if this action makes the organism 
eat one food element; 



Implementing Feedback Action at Every Aa-th Organism’s Action. 

Only at the end of each Aa actions, we take the following feedback actions of 
Step 6~8 for implementing an adaptive supervised learning: 
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Step 6. Obtain the desired action pattern which is designated in section 3.1.1; 
Step 7. Calculate the error metric between the FNN’s output and the desired 
action pattern; 

Step 8. Update the connection weights of the FNN by the SDLM|^ based on 
the minimization of error metrics, where the learning rate is chosen as 0.50; 



Implementing Genetic Transformation and Optimizing a Sequence of 
Feedback Action Period Aa. At the end of one generation comprising 6,500 
actions for each organism, in all probability we would hit a stagnation due to lo- 
cal minima or a semi-stagnation point due to inefficiency caused by too frequent 
a change of network configurations and we undertake the following Step 9^16 
in search for a global minimum solution: 

Step 9. Compute the number of food eaten by each organism in one generation. 
This number comprises the fitness values used in the selection process; 

Step 10. Select the best 20%, i.e. the 20 best organisms to implement agamic(non- 
sexual) reproduction; 

Step 11. Generate 5 copies of each of the 20 best organisms respectively to 
reproduce a new population of 100 organisms; 

Step 12. Apply mutations to 10% of the connection weights of the new 100 
organisms by adding a random real float value between -1.0 and 1.0; 

Step 13. If the average fitness value of the organisms doesn’t hit a stagna- 
tion or a semi-stagnation, go to Step 16, where we define a stagnation and 
semi-stagnation state of fitness values in terms of the stagnation rate: STR = 
Yl\Zk-m+i l/(* + 1) ~ /(Ol- Here f{i) denotes an average fitness value at 
the i-th generation, m=5 is generally chosen implying STR is an average over 5 
consecutive generation. A stagnation state is given by cq > STR while a semi- 
stagnation state is given by ei > STR > cq, where 0.01 and 0.1 are chosen for 
£o and ei respectively. 

Step 14. If the average fitness value of the organisms hits a stagnation, the 
FNNs will be grown through fuzzy logic technique (see for detail), where the 
candidate pool contains 3 sets of candidates and the number of candidates per 
set varies from 1 to 3; 

Step 15. If the average fitness value of the organisms hits a semi-stagnation, 
Aa will increase by 1; 

Step 16. Start the new generation with the same food distribution as that in 
the previous generations. 

Stopping condition: The above process continues for a given number of gen- 
erations(say, 200 generations). 

Step 15 effectively increases the action period, thus expanding the temporal 
averaging period of an otherwise too quickly changing pattern of desired actions. 
This contributes to an improved solution method in finding a global minimum 
solution. The semi-stagnation point is associated with direct interactions by 
the competing organisms and we try to optimize the network configurations 
by adjusting the given feedback action period Aa in minimizing feedback error 
metrics. The stagnation point is associated with local minimum jams and we try 



Simulating Competing Alife Organisms 273 



to cope with the situation firstly by genetic transformation and then by adding 
more flexible neural networks for optimizing the network performance. 

3.2 Reinforcement Learning 

We consider another popular learning method RL (reinforcement learning) which 
considers the whole problem of a goal-directed agent interacting with an uncer- 
tain environment. The evaluative feedback-oriented RL evaluates the integrated 
performance of the system; if actions are followed by a satisfactory state of af- 
fairs, or by any visible improvement in the state of affairs, the reinforcing is 
activated. In some typical cases reported where reinforcement learning is used 
as a learning paradigm involving neural networks, genetic algorithms and arti- 
ficial life|3], organisms resort to internally generated feedback for learning and 
evolve autonomously a specification of what to learn entirely based on a rein- 
forcement signal; but note then the reinforcement signal (death or reproduction) 
is heavily delayed and of relatively little use during life. But in our problems of 
a competitive environment, the organism’s interaction with its environment in 
fact provides a useful reinforcement signal for use during life. An increase in the 
number of food elements eaten corresponds to a reward, otherwise to a penalty. 
Thus, we use a simplified version of associative reward-penalty algorithm where 
a food eaten criterion is employed to generate a two- valued reinforcement signal. 



Designing RL-Guided Desired Actions. The implemented RL algorithm 
for each organism (represented by a FNN) is described as below. 

At every Aa-th action of an organism, a two- valued reinforcement signal S 
is generated: S=-f 1, if the organism eats at least one food 

element within Aa actions; 

S=-l, otherwise. 

This is used to produce a desired (teaching) action pattern Yf{i = 0, 1) from 

.,^fR(y.-i) 5 = +i 



the last output Yi{i = 0, 1) of the FNN: Yf 



\ 1 - R(y, - i) 5 = -1 



where H is the Heaviside step function and Yi is the output of the FNN. The 
teaching action pattern then serves as the feedback from the environment to 
evaluate the error metric used in the SDLM|6] after each Aa actions. The simu- 
lation procedure of RL-guided FuzGa is the same as that of ASL-guided FuzGa 
in section 3.1.2. 



4 Results of Simulations 

4.1 Fitness and Deviation 

Figure 2 shows the fitness values for ASL- and RL-guided simulations by FuzGa-, 
GasGor-, and non-constructive Fixed FNN-driven organisms for various feedback 
action periods of Aa in the simulations, where some are fixed and some are cho- 
sen adaptively in accordance with the steps 9~16 of 3.1.4. All of fitness values 



274 Jianjun Yan, Naoyuki Tokuda, and Juichi Miyamichi 



(le\iation 



1. ASL-guidedFuzGa. with ad^ the Aoc 

2. ASL-guided Cas Cor wifli adaptive A CL 

3. RL-giiidedFuzGawithad^tive A Ot 

4. ASL-guided FiizGa with A Ot =1 

5. ASL-guided CasCorwidi A C£=l 

6. RL-guidedFuzGawith Aot=l 

7. ASL-guided Fixed FNN with Aot-1 

8. RL-guided Fixed FNN with A a=l 





1 . ASL-guided FuzGawith ad^tive A oT* 

2 . ASL-guided Cas Cor with adaptive A ot 

3. RL-guldedFiaGa with adaptive Act 

4. ASL-guided FuzGawith A 

5 . ASL-guided Cas Cor with A ot - 1 

6. RL-guided FiaGa with Aot=l 8. RL-guided Fixed 

7. ASL-guided Fixed FNN with A ot =1 FNN with A ot =1 



0 20 40 



80 100 120 
generation 



80 100 120 
generation 



140 160 180 200 



Fig. 2. Fitness values 



Fig. 3. Standard deviations 



given are based on two different averaging operations: for a given organism, an 
average is taken over results of 20 simulations starting from 20 groups of differ- 
ent randomly assigned initial weights at generation 0. The fitness values in the 
figure now refer to averaged values over the entire 100 organisms. Corresponding 
standard deviations to the fitness values are shown in Fig. 3. 

In the ASL-guided simulations by FuzGa- and CasCor-driven organisms, the 
FuzGa outperforms the CasCor in fitness values (see curves 1, 4 and curves 2 , 5 
in Fig. 2). FuzGa with adaptive feedback action periods of Aa gives more than 
a 8.3% improvement in fitness value over the CasCor with adaptive Aa. The 
improvement is reflected also in reduced CPU times by 31.1% where 69.4 hours 
for CasCor is reduced to 47.8 hours for FuzGa (table 1). This is also true for the 
performance level of fixed feed back period of Aa where about 10.2% improve- 
ment is seen in the fitness value at the 200-th generation at Aa = 1 (Fig. 2). 
We can also observe from Fig. 3, the deviations among FuzCa-driven organisms 
are smaller than those of CasCor-driven organisms, implying that the FuzGa is 
more efficient in both network performance and time complexity than the Cas- 
Cor network design algorithm. The improved performance and time complexity 
come from the fuzzy logic technique, genetic algorithm and the sufficiently small 
3 layer design of neural networks adopted where the fuzzy logic technique is 
explored as an efficient learning algorithm to implement a reasonably sized net- 
work construction while the genetic algorithm is used to help design an improved 
network by evolutions. The relatively small number of layers facilitates an uti- 
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lization of an efficient steepest descent method in narrowing down the solution 
space of fuzzy logic efficiently, contributing to considerable savings in time in 
the processing of network learning and connection. 

As figures 2 and 3 show, the constructive FuzGa with adaptive feedback ac- 
tion periods of Aa gives a best performance giving largest fitness and smallest 
deviation values when compared with the fixed Aa situations. Unlike our expec- 
tation that Aa =1 might be a best choice, the adaptive scheme we proposed in 
Step 15 of 3.1.4 gives by far a better result. The role of Aa is quite decisive in 
accelerating a global minimum solution. For instance, in ASL-guided FuzGa sim- 
ulations the adaptive Aa gives more than a 10% improvement in fitness value 
at the 200-th generation over that of Aa=\ (see curves 1 and 4 of Fig. 2 for 
details). A smaller Aa is effective in the early generations of training the or- 
ganisms where the rate of changes in the fitness value is large. This should be 
compared with the behavior at larger generations where a larger Aa is employed 
to train the organisms. We see that since a steady robust solution prevails, the 
rate of changes of the fitness value is small so that little learning is needed any 
longer then. We see then that a shorter and longer temporal averaging of rapidly 
changing action patterns leads to a more stable solution at smaller and at larger 
generations respectively. 

Because of added flexibility due to hidden neurons, FuzGa-driven organisms 
achieve a higher level of fitness values attaining a global minimum solution more 
effectively than Fixed FNN-driven organisms. We see in Fig. 2 that Fixed FNN 
encounters a difficulty in getting out of local minimum solutions. 

Although not effective in ensuring a high performance in fitness values, RL 
provides an extremely effective strategy for minimizing the level of deviations 
among organisms implying that a very fair competition is being ensured among 
the organisms. This can be explained by its goal-directed learning strategy 
adopted. Without making any use of the nearness information of the food el- 
ement which naturally gives a most efficient information for our purpose thus 
improving an individual’s fitness values, RL does evaluate the effectiveness of the 
last action taken and enforce the FNNs to be trained constantly increasing the 
possibility of eating a food element at a next action blindfoldedly so to speak. It 
implies that a failure of one organism at one time leads to its success at a next 
time step but only by sacrificing the chances of other organisms. 

4.2 The Number of Hidden Neurons and Optimal Sequence of Aa. 

As in [Sj, we also have found that there exists a reasonably sized architecture 
for the constructive FuzGa for this problem which is essentially reflected by the 
number of hidden neurons grown during the construction (compare Fig. 4 with 
Fig. 8 of 0). Fig. 4 shows the growing procedure of hidden neurons for ASL- 
guided as well as RL-guided FuzGa. Just like FuzGa in [ 0 ], we have extended 
the range of simulations to 300 generations to confirm that ASL-guided FuzGa 
converges at the value of 48 as compared with 66 of RL-guided FuzGa. 

An optimal sequence of Aa as determined by the procedure of section 3.1.4 is 
shown in Fig. 5 for both ASL- and RL-guided FuzGa. We have further extended 
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the range of simulations to 500 generations and confirm that the Aa converges 
at the value of 28 for ASL-guided FuzGa and 31 for RL-guided FuzGa. 

The RL-guided FuzGa must consume much more hidden neurons and a larger 
Aa at the later generations than the ASL-guided FuzGa and GasGor. By its 
nature of evaluative feedback, the reinforcement signal in the RL is less effective 
in providing an efficient search strategy for a global minimum. For the RL to 
get out of local minima, FuzGa must insert hidden neurons and increase the Aa 
more frequently. The ASL-guided GasGor starts with a smallest network with 
no hidden neuron, but installs hidden neurons more frequently than the ASL- 
guided FuzGa during the constructive procedure so that the number of hidden 
neurons keeps increasing. 

4.3 Time Complexity 

Table 1 shows the computational time of ASL-guided FuzGa and GasGor in 200 
generations with different parameter Aa. The number of food elements available 
in each epoch(average of 100 actions of the organisms) is 1000. 

We see that the computational time decreases as the Aa increases so that 
time consuming adjusting of the connection weights of the FNN is implemented 
less frequently. Due to its very deep cascade architecture, the GasGor seems easily 
trapped at larger generations into local minima, thus requiring much more time 
to escape from these local hollows. 



Table 1. GPU time (hours) in 200 generations 
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4.4 Comparison Between the Competitive Environment and the 
Isolated World Based on Food Availability 

Up to this point, we have examined how the different FNN-driven organisms 
perform in a competitive Alife environment with two online learning algorithms. 
In this subsection, we set out to find out how our competitive model differs from 
an isolated model. We examine this by keeping the feedback action periods of 
Aa adaptive to ensure the best performance level of the systems. We keep the 
simulated world fixed at 1000 x 1000 lattice cells. In a competitive environment, 
all the organisms compete for food elements in the same world so that the 
intensity of interactions obviously depends on the food availability Fa to be 
defined as follows. 



Fa 



the maximum number of food elements available in the environment 
the number of organisms in the environment 



We perform the simulations by changing Fa from 0.1 to 1000, a larger Fa 
characterizing the stronger effect of interactions. 
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Fig. 6. Value of Aa for Fa 



Fig. 7. Number of hidden neurons for Fa 



Effect of Fa on Aa in a Competitive and Isolated Environment. Fig- 
ure 6 shows how an adaptively chosen sequence of the Aa change with various 
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values of food availability(FA) in both the competitive and the isolated simu- 
lated worlds respectively. Curves 1, 3, 5 and 7 give the simulation results in the 
isolated environment and curves 2, 4, 6 and 8 in the competitive world. 

At low values of Fa, where the effect of interactions is very small, the adap- 
tively chosen Aa for both cases are little different but show marked differences 
at higher Fa for all values of generations. At small generations where the FNNs 
are at learning stages, Aa remains stationary at small Fa so that the FNNs take 
some considerable time (or generations) to adjust the network configurations in- 
cluding weights adjustment. This should be contrasted for larger Fa where the 
effect of interactions is large where too frequent modifications of network con- 
figurations result in inefficiency. The differences between the competitive and 
isolated environments become quite distinct then. At larger times where the 
networks complete the learning stages, Aa remains stationary for both cases. 



Effect of Fa on the Number of Hidden Neurons in a Competitive and 
Isolated Environment. By comparing figures 6 and 7, we immediately see 
that the behavior of Aa and the number of hidden neurons are almost identical 
with respect to the parameter Fa- This is not surprising however if we note 
that both Aa and the number of hidden neurons are brought into the FNNs 
architecture design to cope with the effect of interactions which inevitably bring 
in complex solution space. For example, we know that a complex solution space 
demands more flexible network with larger hidden neurons [8] . For example, we 
expect for a large Fa where nonlinear interactions are stronger the number of 
hidden neurons will increase. 



5 Conclusion and Future Work 

After extensive Alife simulations, we have found that among the other construc- 
tive neural network systems such as CasCor, the three-layer constructive FuzGa 
has proved to give a most efficient performance in fitness values with relatively 
small deviations in a competitive environment when guided by the online in- 
structive ASL learning algorithm. On the other hand, an evaluative RL learning 
algorithm gives a least deviations at the sacrifice of fitness values. We have 
demonstrated that the intensity of interactions is governed by a dimensionless 
food availability number Fa and that an adaptive feedback period Aa and the 
number of hidden neurons play an extremely important role in the competitive 
environment. These facilitate an improved solution to the extremely complex 
solution space with the effect of interactions taken into account. We have also 
demonstrated that the competitive solution reduces to the isolated solution as 
Fa becomes small so that the effect of the interactions is reduced. 

The algorithm we have developed for computing the desired action patterns 
seems general enough to be applicable to many dynamically changing environ- 
ments. When combined with a fuzzy logic technology exploiting the efficient 
SDLM in minimizing the resulting error metrics, the method provides a most 



Simulating Competing Alife Organisms 279 



effective means of providing fully supervised information which greatly benefits 
many supervised learning algorithms such as the Back-Propagation algorithm. 

The current simulation model is based on the simplest heuristics of capturing 
a nearest food element which may be effective under normal conditions. Often 
the position of the other competing organisms relative to the nearest food may 
suggest a different strategy. For example, it may be wise to avoid the nearest 
target if the position is surrounded by too many organisms nearby. We want 
to consider a more advanced strategy by considering other organisms’ positions 
or even the communication scheme between the organisms for implementing 
more advanced strategy. For example, the real environments inevitably involve 
individual differences of organisms in predation so that means of communication 
and other social behaviors may play a role in planning an optimal strategy. 
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Abstract. Though impressive classihcation accuracy is often obtained 
via discrimination-based learning techniques such as Multi-Layer Per- 
ceptrons (DMLP), these techniques often assume that the underlying 
training sets are optimally balanced (in terms of the number of positive 
and negative examples). Unfortunately, this is not always the case. In this 
paper, we look at a recognition-based approach whose accuracy in such 
environments is superior to that obtained via more conventional mech- 
anisms. At the heart of the new technique is a modified auto-encoder 
that allows for the incorporation of a recognition component into the 
conventional MLP mechanism. In short, rather than being associated 
with an output value of ”1”, positive examples are fully reconstructed 
at the network output layer while negative examples, rather than being 
associated with an output value of ”0”, have their inverse derived at 
the output layer. The result is an auto-encoder able to recognize posi- 
tive examples while discriminating against negative ones by virtue of the 
fact that negative cases generate larger reconstruction errors. A simple 
technique is employed to exaggerate the impact of training with these 
negative examples so that reconstruction errors can be more reliably es- 
tablished. Preliminary testing on both seismic and sonar data sets has 
demonstrated that the new method produces lower error rates than stan- 
dard connectionist systems in imbalanced settings. Our approach thus 
suggests a simple and more robust alternative to commonly used classi- 
fication mechanisms. 



1 Introduction 

Concept learning tasks represent a form of supervised learning in which the goal 
is to determine whether or not an instance belongs to a given class. As would 
be expected, the greater the number of training examples, the more reliable the 
results obtained during the training phase. In addition, however, we must also 
acknowledge that the success of supervised learning algorithms is at least partly 
determined by the balance of positive and negative training cases. For training 
purposes, then, we would consider a data set optimal if, in addition to a certain 
minimal size, its instances were split more or less evenly between positive and 
negative examples of the concept in question. This type of division would ensure 
that our learning algorithms are not unduly skewed in favour of one case or the 
other. 
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Unfortunately, such optimality is often hard to guarantee in practice. In many 
domains, it is neither possible nor feasible to obtain equal numbers of positive 
and negative instances. For example, the analysis of seismic data in terms of 
its association with either naturally occurring geological activity or man-made 
nuclear devices is hampered by the fact that examples of the latter are extremely 
uncommon (and rigidly controlled). Seismic applications are not the only ones 
suffering from imbalanced conditions. The problem was also documented in ap- 
plications such as the detection of oil spills in satellite radar images [Kubat et 
ah, 1998], the detection of faulty helicopter gearboxes [Japkowicz et ah, 1995] 
and the detection of fraudulent telephone calls [Fawcett & Provost, 1997]. Thus, 
supervised learning algorithms used in such environments must be amenable to 
these inherent restrictions. 

In practice, many algorithms do not perform well when the training set is 
imbalanced (see [Kubat et al. 1998] for an illustration of this effect). Since a 
significant number of real-world domains can be described in this manner, it 
seems logical to pursue mechanisms whose performance suffers less drastically 
when counter examples are relatively hard to come by. In this paper, we present 
preliminary results obtained via the use of a Connectionist Novelty Detection 
method known as auto-encoder-based classification. Essentially, auto-encoder- 
based classifiers learn how to recognize positive instances of a concept by iden- 
tifying their common patterns. When later presented with novel instances, the 
auto-encoder is able to recognize cases whose characteristics are in some way 
similar to its positive training examples. Negative instances, on the other hand, 
generally have little in common with the training input and are therefore not 
associated with the concept under investigation. 

Though the auto-encoder as just described has been successful within a num- 
ber of domains, it has become clear that not all environments are equally recep- 
tive to a training phase completely devoid of counter examples. More specifically, 
auto-encoders tend not to be as effective when negative instances of the concept 
exist as a subset of the larger positive set. In such cases, the network is likely 
to confuse counter examples with the original training cases since it has had 
no opportunity to learn those patterns which can serve to delineate the two. 
Consequently, the method presented here will incorporate a local discrimination 
phase within the general recognition-based framework. The result is a network 
that can successfully classify mixed instances of the concept, despite having been 
given a decidedly imbalanced training set. 



2 Previous Work 

Although the imbalanced data set problem is starting to attract the attention of 
a number of researchers, attempts at addressing it have remained uncoordinated. 
Nevertheless, these research efforts can be organized into four categories 

— Methods in which the class represented by a small data set gets over-sampled 
so as to match the size of the opposing class. 
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— Methods in which the class represented by the large data set can be down- 
sized so as to match the size of the other class. 

— Methods that internally bias the discrimination-based process so as to com- 
pensate for the class imbalance. 

— Methods that ignore (or makes little use of) one of the two small classes 
altogether. 

The first method was used by [Ling & Li, 1998]. It simply consists of augment- 
ing the small data set by re-sampling the instance multiple times. Other related 
schemes could diversify the augmented class by injecting some noise into the 
repeated patterns. The second method was investigated in [Kubat & Matwin, 
1997] and consists of removing instances from the well-represented class until 
it matches the size of the smaller class. The challenge of this approach is to 
remove instances that do not provide essential information to the classification 
process. The third approach was studied by [Pazzani et ah, 1994] who assigns 
different weights to examples of the different classes, [Fawcett & Provost, 1997] 
who remove rules likely to over-fit the imbalanced data set, and [Ezawa et ah, 
1996] who bias the classifier in favour of certain attribute relationships. Finally, 
the fourth method was studied in its extreme form (i.e., in a form that com- 
pletely ignores one of the classes during the concept-learning phase) by [Jap- 
kowicz et ah, 1995]. This method consisted of using a recognition-based rather 
than a discrimination-based inductive scheme. Less extreme implementations 
were studied by [Riddle et ah, 1994] and [Kubat et ah, 1998] who also employ a 
recognition-based approach but use some counter-examples to bias the recogni- 
tion process. Our current study investigates a technique that falls along the line 
of the work of [Riddle et ah, 1994] and [Kubat et ah, 1998] and extends the auto- 
encoder approach of [Japkowicz et ah, 1995] by allowing it to consider counter 
examples. The method, however, differs from [Riddle et ah, 1994] and [Kubat et 
ah, 1998] in its use of the connectionist rather than rule-based paradigm. 

Our method is also related to previous work in the connectionist community. 
In the past, auto-encoders have typically been used for data compression [e.g., 
Cottrell et ah, 1987]. Nevertheless, their use in classification tasks has recently 
been investigated by [Japkowicz et ah, 1995], [Schwenk & Milgram, 1995], [Gluck 
& Myers, 1993] and [Stainvas et ah, 1999]. [Japkowicz et ah, 1995] and [Schwenk 
& Milgram, 1995] use it in similar ways. As mentioned previously, [Japkowicz 
et ah, 1995] use the auto-encoder to recognize data of one class and reject data 
of the other class. [Schwenk & Milgram, 1995], on the other hand, use it on 
multi-class problems by training one auto-encoder per class and assigning a 
test example to the class corresponding to the auto-encoder which recognized 
it best. Both [Gluck & Myers, 1993] and [Stainvas et ah, 1999] use the auto- 
encoder in conjunction with a regular discrimination-based network. They let 
their multi-task learner simultaneously learn a clustering of the full training set 
(including conceptual and counter-conceptual data) and discriminate between 
the two classes. The discrimination step acts as both a labelling step (in which 
the clusters uncovered by the auto-encoder get labelled as conceptual or not) and 
a fine-tuning step (in which the class information helps refine the auto-encoder 
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clustering). Our method is similar to [Gluck & Myers, 1993] and [Stainvas et al., 
1999] in that it too uses class information about the two classes and lets this 
information act both as a labelling and a fine-tuning step. However, it differs in 
that the auto-encoder is used in a different way for each class. 

3 Implementation 

Auto-encoder. As stated, an auto-encoder learns by determining the patterns 
common to a set of positive examples (in the standard case). It then uses this 
information to generalize to examples it has not seen before. In terms of the 
training itself, the key component is a supervised learning phase in which in- 
put samples (in the form of a multi-featured vector) are associated with an 
appropriate target vector. The target, in fact, is simply a duplicate of the input 
itself. In other words, the network is trained so as to reproduce the input at 
the output layer. This, of course, stands in contrast to the conventional neural 
network concept learner which is trained to associate positive instances with a 
target value of ” 1” and negative examples with a ” 0” . The architectures of the 
auto-encoder (with 6 input/output units and 3 hidden units) versus that of the 
conventional neural network (with 6 input units, 3 hidden units and 1 output 
unit) are illustrated in Figure [TJ 



Output Layer 




Input Layer 



Output Layer 




Input Layer 



Hidden 

Layer 



(a) DMLP 



(b| RMLP 



Fig. 1. Examples of Feedforward Neural Networks: (a)discrimination-based 
DMLP; (b) recognition-based RMLP 



Once an auto-encoder has been trained, it is necessary to provide a means 
by which new examples can be accurately classified. Since we no longer have a 
simple binary output upon which to make the prediction, we must turn to what 
is called the ” reconstruction error” . The reconstruction error is defined as 

/c 

2=1 



( 1 ) 
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where I(i) and 0(i) are the corresponding input and output nodes at position 
i and k represents the number of features in the vector. In other words, we 
ascertain the degree to which the output vector — as determined by the trained 
network — matches the individual values of the original input. Of course, in order 
to apply the reconstruction error concept, we must establish some constraint on 
the allowable error for instances that will be deemed to be positive examples of 
the given concept. To do so, we include a threshold determination component in 
the training phase. Since we would like the threshold T to be closely associated 
with the mean of the full set R of individual reconstruction errors, we define the 
threshold as follows 

T = mean{R) + Z{std{R)) (2) 

We use the Z parameter as a means of controlling the range of acceptable recon- 
struction error values. In essence, it represents the desired confidence interval 
for the mean reconstruction error and, as such, corresponds to the standard Z 
value commonly used in statistical analysis. By combining the Z value with the 
mean and standard deviation of the error distribution, we may efficiently tune 
our boundary to suit the input at hand. 

Extended auto-encoder. In the current study the auto-encoder has been extended 
to allow for what we call local discrimination. In other words, a small number 
of negative examples are included in the training set so that the network has 
the opportunity to determine those features that differentiate clusters of nega- 
tive instances from the larger set of positive instances. In cases where there is 
considerable overlap between concept and non-concept instances, it is expected 
that this additional step may significantly lower classification error. The exten- 
sion to the new model is relatively straight-forward. As before, target values for 
positive input are represented as a duplication of the input vectors. In contrast, 
however, the target vector for negative instances is constructed as an inversion of 
the input. For example, the three-tuple |0.5, 0.6, 0.2^ would become i-0.5, -0.6, - 
0.2^ at the output layer. During the threshold determination phase, the network 
output vectors for these negative examples are assessed relative to the vectors 
that would have been expected had the input actually been a positive example. 
In other words, the negative reconstruction error is the ’’distance” between the 
inverted output and the original input. 

Armed with this new information, we are now able to establish both positive 
and negative reconstruction error ranges. A definitive classification boundary 
is determined by finding the specific point that offers the minimal amount of 
overlap. Though this might at first appear to be a trivial task, in practice it 
is somewhat more complicated than expected. Typically, due to the underlying 
data imbalance, the range of positive reconstruction errors is much more tightly 
defined than the range of negative errors (i.e., more compactly clustered around 
the mean). For this reason, it is necessary to skew the boundary towards the mean 
of the positive reconstruction error. In our study, this extra step was not required 
since we employed a ’’target shifting” technique (see below) that significantly 
reduced the likelihood of boundary overlap. As a result, we were simply able 
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to utilize the positive reconstruction boundary as described in the preceding 
section. Once the final boundary condition has been established, classification of 
novel examples is relatively simple. Instances are passed to the trained network 
and reconstruction errors calculated. Errors below the boundary are associated 
with positive examples of the concept, while those above signify non-concept 
input. 

Target Shifting. In theory, the local discrimination technique as just described 
should produce distinctive error patterns for both positive and negative training 
input. Unfortunately, initial testing using this basic scheme was quite disappoint- 
ing. In particular, it proved almost impossible to produce non-overlapping error 
ranges. An analysis of the raw network output showed that the auto-encoder 
was indeed inverting the negative training cases. However, it was clear that the 
negative reconstruction was simply much smaller than expected. The problem 
was two-fold. First, the inclusion of empty or zeroed feature values effectively 
reduced the number of vector elements that could contribute to the reconstruc- 
tion error. For example, if a domain provides twenty distinctive features for each 
instance, but individual cases rarely have more than five or six non-zero feature 
values, then the ability of the network to produce distinguishable error ranges 
is severely curtailed. Second, and perhaps more importantly, the normalization 
of input prior to training can have a deleterious effect upon threshold determi- 
nation. In particular, the existence of out-liers in the original input set has a 
tendency to squash many feature values down towards zero. If these near-zero 
features belong to negative training examples, then the inverted features will 
be deceptively close to the non-inverted input. For example, a normalized input 
value of 0.001 would become -0.001 in a perfectly trained network. Consequently, 
it is likely that the reconstruction error for many negative training cases will be 
no greater than their positive counterparts. 

To combat this problem, it was necessary to utilize some mechanism that 
could exaggerate the error associated with negative examples while leaving the 
positive reconstruction error unchanged. We chose to employ a simple technique 
by which the entire normalized range of input values was shifted in order to 
create target output. Positive instances are modified simply by incrementing 
each element of the input vector by one. Negative input is also incremented 
but, in this case, the sign of each element is also inverted. For example, the 
positive input vector jO.2, 0.3, 0.4^ becomes jl.2, 1.3, 1.4^ while the negative 
vector jO.2, 0.5, 0.9^ becomes i-1.2, -1.5, -1.9^. This approach to target vector 
generation has two fundamental advantages. First, it maintains the normalized 
input patterns so that features with large absolute values do not dominate the 
training phase. Second, and more significantly in the current context, we can 
ensure that properly recognized negative instances will result in the generation 
of significantly exaggerated reconstruction errors. Typically, negative instances 
produce values greater than two for each component of the vector while posi- 
tive instances contribute errors of less than one per component. (Note: We say 
’’typically” since the network is unlikely to perfectly transform all features into 
the expected ranges). In the initial implementation, only non-zero vector ele- 
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merits were actually inverted, the belief being that transforming these ’’empty” 
features might hamper the network’s ability to properly recognize the original 
input. However, in practice, the primary result of not shifting the zero values was 
to distribute more evenly target output and, in the process, to bring positive and 
negative boundaries much closer together. As a result, all values were inverted 
in subsequent experiments. 

Target shifting has proven to be a simple but effective technique for es- 
tablishing appropriate reconstruction error constraints. As should be obvious, 
domains exhibiting a higher number of features generally produce more strik- 
ing differences between positive and negative boundaries. Nevertheless, even for 
feature-poor domains, it is generally quite easy to define the appropriate ranges. 



4 Seismic Data 

Description. We applied our technique to the problem of learning how to dis- 
criminate between seismograms representing earthquakes and seismograms rep- 
resenting nuclear explosions. The database contains data from the Little Skull 
Mountain Earthquakes 6/29/92 and its largest aftershocks, as well as nuclear ex- 
plosions that took place between 1978 and 1992 at a nuclear testing site near the 
Lawrence Livermore Labs. The long-range motivation for this application is to 
create reliable tools for the automatic detection of nuclear explosions throughout 
the worlc0 in an attempt to monitor the Comprehensive Test Ban Treaty. This 
discrimination problem is extremely complex given the fact that seismograms 
recorded for both types of events are closely related and thus not easily distin- 
guishable. In addition, due to the rarity of nuclear explosions and earthquakes 
occurring under closely related general conditions (such as similar terrain) that 
can actually allow for fair discrimination between the two types of events, sig- 
nificant imbalances in the data sets can be expectedEl Specifically, our database 
contains more nuclear explosion than earthquake data since the chances of nat- 
ural seismic activity taking place near the nuclear testing ground are very slim. 
Nevertheless, there is a strong appeal in automating the discrimination task 
since, if such a computer-based procedure could reach acceptable levels of accu- 
racy, it would be more time-efficient, less prone to human-errors, and less biased 
than the current human-based approaches. 

The seismic data set used in the study is made up of 49 samples, 31 represent- 
ing nuclear explosions and 18 representing naturally occurring seismic activity. 
Each event is represented by 6 signals which correspond to the broadband (or 
long period) components BB Z, BB N and BB E and the high frequency (or 
short period) components HE Z, HE N and HE E. Z, N, and E correspond to 

^ Relevant seismic data can be recorded in a station located thousands of kilometers 
away from the site of the event. 

^ In the more useful setting where seismograms can be transformed so that the sur- 
rounding conditions do not need to be constant, large imbalances in the data sets 
would remain, though this time they would be caused by the scarcity of nuclear 
explosions and the ubiquity of earthquakes of various types around the world. 
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the Vertical, North and East components that refer to projections of the seismo- 
grams onto the Vertical, North and East directions, respectively. All the signals 
were recorded in the same locale although the earthquake data is divided into 
three different classes of events that took place at different locations within that 
area and at different points in time. Because the broadband recordings were not 
specific enough, we worked instead with the short period records. 

The signal onset was selected manually by inspection of each signal. Since the 
exact onset could not always be determined, the starting point was uniformly 
chosen so as to be slightly past the actual onset for all signals. Clipping was 
done after 4096 recording. This number was chosen because it includes the most 
informative part of the signals and it is also a power of 2. (Because of the second 
feature, application of the Fast Fourier transform using the MATLAB Statistical 
package is faster.) Although the overall files did contain spikes, the parts of the 
signals selected for this study were not spiky, so no additional procedure needed 
to be applied in order to deal with this issue. The signals were then de-trended, 
transforming them so that the collective set exhibited a zero mean. Next, the 
signals were normalized between 0 and 1 in order to make them suitable for 
classification. Finally, the signal representation was changed by converting the 
time-series representation to a frequency representation using MATLAB ’s Fast 
Fourier transform procedure. Although some of the earthquake files seemed to 
contain several events, we only kept the first one of these events in each case. 

Experimental Details. All training and testing was conducted within MATLAB’s 
Neural Network Toolbox. As mentioned, data was normalized into a 0-1 range 
(before target shifting) and features made up of zero values across all input vec- 
tors were removed. Network training was performed using the Resilient Backprop 
(Rprop) algorithm which offered extremely rapid training times in our study (for 
a more complete description of Rprop, see [Riedmiller & Braun, 1993]). 

To assess the impact of the auto-encoding with local discrimination, we chose 
a pair of comparative tests. In the first case, each of three relevant network ar- 
chitectures — DMLP (conventional discrimination-based MLR), RMLP (basic 
recognition-based auto-encoder), and XRMLP (auto-encoder with local discrim- 
ination) — was trained on a ’’set” number of seismic records (RMLP, of course, 
relied only upon positive training instances). We must note, here, that the rela- 
tively small size of the data sample made network training and testing somewhat 
difficult. Splitting the input set into two equal subsets for training/testing and 
cross-validation would have left too few cases in each of the partitions; results 
would likely have been too inconsistent to have been of much value. Instead, net- 
work parameters were established by using the entire set as a training/testing 
set. Two thirds of the positive and negative cases were chosen at random and 
were used for training, while the remaining third went into a test set. Hidden 
unit counts of 16 for DMLP and 64 for both auto-encoders were established 
in this manner (Rprop does not use a learning rate or momentum constant). 
The data set was then re-divided into five folds and, using the hidden unit pa- 
rameters established in the previous step, three separate test cycles of 5-fold 
cross-validation were performed. On the full (i.e, relatively balanced) data set. 
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the XRMLP network produced a cross- validated error of 0.161, while the DMLP 
and RMLP generated 0.212 and 0.387 respectively. 

In the second - and more significant - phase of testing, our goal was to directly 
compare the impact of reducing the number of negative training units upon the 
error rate of both DMLP and XRMLP. In this phase, we used from 1 to 10 
negative training samples and performed 5 network training cycles at each level. 
(Here, final testing was performed on the full data set since there were simply not 
enough negative examples to use the previous 5-fold cross-validation technique) 
Figure [2] is a graphical representation of the results, while Table [1] lists both the 
mean and standard deviation for each of the separate tests. 



Variation on Negative Cases (Seismic) 




Fig. 2. DMLP vs. XRMLP 

There are two points of interest with respect to Table [TJ First, the auto- 
encoder provides a lower error rate when small numbers of negative samples are 
used; only at ten instances does the DMLP show improved accuracy. Second, 
even though the mean error rate of the DMLP diminishes as the number of 
negative samples increases, its standard deviation is much higher than that of 
the auto-encoder in the last two recordings (i.e., 5 and 10 cases). The implication, 
of course, is that DMLP is much more dependent upon the specific set of negative 
training samples with which it is supplied. 



Table 1. Negative sample variation 





Architecture 


Negative Samples 


DMLP 


XRMLP 


1 


.421 ±0.0 


.378 ±.068 


2 


.336 ±.029 


.294 ±.047 


5 


.199 ±.086 


.178 ±.029 


10 


.147 ±.125 


.188 ±.047 
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5 Sonar Data 

Description. The sonar detection problem takes as input the signals returned 
by a sonar system in those cases where mines and rocks were used as targets. 
Transmitted sonar signals take the form of a frequency-modulated chirp, rising 
in frequency. In the current context, signals were obtained from a variety of 
different aspect angles. Each instance of the data is represented as a 60-bit long 
vector spanning 90 degrees for the mine and 180 degrees for the rock. Samples 
are represented as a set of 60 numbers in the range 0.0 to 1.0., where each 
number represents the energy within a particular frequency band, integrated 
over a certain period of time. The data itself was obtained from the U.C. Irvine 
Repository of Machine Learning. In total, there were 111 mine samples (positive 
class) and 97 rock samples (negative class). 

Eperimental Details. As before, experimental results were obtained using MAT- 
LAB ’s Neural Net Toolbox. Though Rprop again provided good performance for 
the DMLP network, it was not as effective in the XRMLP environment. More 
specifically, even while using the target shifting technique, we found significant 
overlap between the positive and negative reconstruction boundaries, so much so 
that Rprop results proved unreliable at best. Experimentation with a variety of 
other training function!^ eventually demonstrated that One Step Secant (OSS) 
was most appropriate for this particular data set. (Note: DMLP accuracy did not 
improve with any of these other training functions.) What was most interesting 
about OSS was the definitive nature of its classification decisions (for further 
information regarding OSS, see [Batti, 1992]). Though it did not always classify 
correctly, there was generally little question as to how to assess the output vec- 
tors; positive reconstruction errors were typically very small (i.e., less than 5) 
while negative reconstruction errors were quite large (i.e., greater than 100). 

The decisive classification of OSS, however, necessitated some minor changes 
in the boundary determination phase. Because of the marked difference in the 
absolute values of the individual positive and negative reconstruction errors, it 
was possible for a small number of network classification errors to grossly inflate 
the mean reconstruction error. As such, subsequent testing would be adversely 
affected in that a number of negative test cases would likely fall inside the inflated 
boundary and be classified incorrectly. Our solution, therefore, was to prune the 
original set of reconstruction errors by removing all error values that clearly 
represented classification errors. In this case the heuristic used was to exclude 
those values which were more than five times greater than the median value in 
the reconstruction set. 

In terms of the tests themselves, we chose to focus exclusively on the com- 
parison of DMLP and XRMLP. Data was randomly divided into a training set 
of 108 samples (61 positive, 47 negative) and a cross-validation testing set of 
100 samples (50 positive, 50 negative). For both networks, training produced 

® One Step Secant, Gradient Descent backpropagation, Bayesian Regulation backprop- 
agation, and One-vector-at-a-time training 
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an optimal hidden unit count of 32, though the OSS mechanism proved to be 
relatively robust at a variety of hidden unit counts. As was the case during the 
seismic testing, we were interested in the impact upon classification accuracy 
as the number of negative training samples decreased. Therefore, we compared 
the two architectures by training the networks on a small number of negative 
samples randomly selected from the available training set. Figure [3] graphically 
displays the results for negative training samples in the range one to ten (using 
five- fold cross-validation), while Table |2] depicts the same results in the form of 
a 95% confidence interval. 



Variation on Negative Cases (Sonar) 




Fig. 3. DMLP vs. XRMLP 



As before, the results clearly demonstrate the benefit of the XRMLP mecha- 
nism within dramatically imbalanced settings. Inside the approximately 5:1 ratio 
represented by these tests (47 positive cases versus a maximum of 10 negative 
training cases), the recognition-based approach extracted more information from 
the training samples than did the discrimination-based alternative. We should 
also note that when the two networks were trained in a balanced environment, 
DMLP performance was superior to the XRMLP (0.22 error versus 0.29), per- 
haps suggesting XRMLP accuracy does not necessarily benefit from the unlim- 
ited addition of negative training cases. Even so, however, we must recognize 
that the accuracy of XRMLP on a limited training set is relatively close to that 
of DMLP of a fully-balanced data set. 

6 Conclusions and Future Work 

In this paper, we have discussed an extension to the auto-encoder model which 
allows for a measure of local discrimination via a small number of negative train- 
ing examples. Comparisons with the more conventional DMLP model suggest 
that not only does the new technique provide greater accuracy on imbalanced 
data sets, but that its effectiveness relative to DMLP grows as the ratio of pos- 
itive to negative training cases becomes more exaggerated. In addition, we have 
noted that the auto-encoder appears to be much more stable in this type of 
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Table 2. Negative sample variation 





Architecture j 


Negative Samples 


DMLP 


XRMLP 


1 


.50 ±.08 


.40 ±.06 


2 


.48 ±.03 


.39 ±.14 


3 


.46 ±.03 


.38 ±.11 


4 


.38 ±.07 


.39 ±.14 


5 


.35 ±.08 


.30 ±.15 


6 


.43 ±.07 


.29 ±.18 


7 


.42 ±.13 


.34 ±.05 


8 


.37 ±.11 


.33 ±.14 


9 


.39 ±.07 


.37±.16 


10 


.29 ±.09 


.25 ±.09 



environment, in that its error rates tend to fluctuate relatively little from one 
iteration of the network to the next. 

The lack of success shown by the basic auto-encoder (i.e., without negative 
training samples) also demonstrates that some form of local discrimination is 
important in certain environments. Though such an architecture has proven very 
effective in other settings, it seems clear that the underlying characteristics of 
concept and non-concept instances may sometimes be too similar to distinguish 
without prior discriminatory training. 

There are many possible extensions of this work. First, it will be important 
to assess the accuracy of XRMLP on larger data sets; doing so will allow us 
to experiment with a wide range of imbalance ratios. (Note: preliminary work 
in this regard has been promising). Second, a possible approach to the seismic 
problem that appears promising involves the use of radial basis functions. Third, 
it would be interesting to compare our method to that of [Stainvas et ah, 1999], 
though our technique should be implemented within an ensemble framework in 
order for the comparison to be fair. Finally, it could be useful to extend the 
auto-encoder-based technique described in this paper to multi-class learning (by 
assigning different goals for the reconstruction error of each class) and to compare 
this method to the standard multi-class neural network technique. 
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Abstract. The backpropagation algorithm is an iterative gradient de- 
scent algorithm designed to train multilayer neural networks. Despite its 
popularity and effectiveness, the orthogonal steps (zigzagging) near the 
optimum point slows down the convergence of this algorithm. To over- 
come the inefficiency of zigzagging in the conventional backpropagation 
algorithm, one of the authors earlier proposed the use of a deflecting gra- 
dient technique to improve the convergence of backpropagation learning 
algorithm. The proposed method is called Partan backpropagation learn- 
ing algorithmjS]. The convergence time of multilayer networks has further 
improved through dynamic adaptation of their learning ratesj^. In this 
paper, an extension to the dynamic parallel tangent learning algorithm is 
proposed. In the proposed algorithm, each connection has its own learn- 
ing as well as acceleration rate. These individual rates are dynamically 
adapted as the learning proceeds. Simulation studies are carried out on 
different learning problems. Faster rate of convergence is achieved for all 
problems used in the simulations. 



Keywords: Artificial neural networks, Backpropagation, Gradient descent. Par- 
allel tangent. Dynamic parallel tangent. 

1 Introduction 

Backpropagation (BP) is the most popular and widely used learning algorithm 
for multilayer feedforward neural networks. The main limitation of BP is the slow 
pace at which it learns from examples. This is due to the fact that the standard 
backpropagation method uses fixed learning steps, and as the result slows down 
in fiat areas and starts to take orthogonal steps near the optimum point. Over 
the last number of years, many new accelerating techniques have been developed 
to speedup the rate of convergence in the backpropagation training algorithm. 
A global error gradient adaptation technique called parallel tangent (Partan) 
training algorithm is proposed by one of the authors [3]. The proposed method 
can be used to accelerate the training process in multilayer neural networks. 

In [B] , we have proposed a dynamic parallel tangent learning algorithm that 
further improves the speed of training multi-layer neural networks. The improve- 
ment is done through the dynamic adaptation of the learning rates during the 
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training process. Faster rate of convergence is achieved for all learning problems 
used in the simulation studies of dynamic Partan. 

In this paper, we present an extension to the dynamic Partan learning algo- 
rithm. In the extended dynamic Partan, the learning and the accelerating rates 
are adapted dynamically as the training proceeds. Other acceleration techniques 
can also be incorporated in this method to further improve the rate of conver- 
gence. 

The outline of the paper is as follows. The concept of gradient descent and 
parallel tangent gradient is briefly reviewed in the following section. In Section 

3. Partan backpropagation learning algorithm is explained. Dynamic Partan, ex- 
tended dynamic Partan, and rate adaptation strategies are presented in Section 

4. Subsequently, the results of the simulation studies are summarized in Section 

5. Finally, conclusions of the present study are summarized. 

2 Parallel Tangent (Partan) Gradient 

The method of gradient descent is one of the most fundamental procedures for 
minimizing a differentiable function of several variables. In general, the gradient 
algorithm takes a point pi£ S C and computes a new point Pi+iG S C if”, 
where S represents an arbitrary set and E represents the Euclidean space. The 
new point is defined by making 

p^+i =p, + r]g 

where, 77 > 0 for minimization or 77 < 0 for maximization. Further, pi is the 
origin of the line and g is the gradient vector, S/ f(pi), determining the direction 
and 77 is the step-size parameter to be estimated. The gradient algorithm usually 
behaves poorly near an optimum point where small orthogonal steps are taken 
(zigzagging phenomena) . To illustrate the zigzagging phenomena, let us consider 
an objective function with concentric ellipsoidal contours as shown in Figure [T] If 




Fig. 1. Zigzagging in elliptical contour. 
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the initial point for a gradient search happens to be precisely on one of the axes 
of the systems of ellipses, the gradient line will pass right through the optimum 
(peak) and the search will be over in one descent (ascent). Otherwise, the search 
will follow a zigzag course such as the one from po to p2 to ps to p4 , etc. (in 
order to be consistent with the convention adopted in the next section, after po 
the next point is denoted as p2, instead of pi). 

It can be seen that the crooked path is bounded by two straight lines which 
intersect at the optimum. This suggests that the search from point ps be con- 
ducted, not in the gradient direction toward p4, but along the straight line from 
Po through p3. In this way, the peak, p* would be located after three steps: first 
from Po to p2 along the gradient at po, then from p2 to pa along the gradient 
at p2, and finally from pa along the line through po and pa. This is the two 
dimensional version of a method which accelerates along a ridge and usually is 
called gradient parallel tangents (gradient Partan) [ 2 | 8 | 9 J . 

Parallel tangent has many forms and gradient Partan is one form which com- 
bines many desirable properties of the simple gradient methods |S]. This tech- 
nique represents a distinct improvement over the method of steepest descent. 
Figure m shows a schematic diagram of general parallel tangent. Note that the 




Fig. 2. Parallel Tangent Gradient. 



points have been numbered such that the odd-numbered ones (i.e., P3, P5, P7, 
Pg, etc.) are the results of a climb (gradient search), whereas the even-numbered 
ones following p2 (i.e., P4, Pe, Pg, etc.) are obtained by acceleration. In other 
words, the even-numbered point p2k is determined by acceleration from P2fe-4 
through p2fc-i, fc = 2 , 3 , ..., N, i.e., 

P2k = ^{P2k-l,P2k-4) 

where fl is the acceleration function. Acceleration is the process of taking the 
minimum point on the line connecting P2k-i and p2k-A- 
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A procedure for minimizing a differentiable function of several variables is 
given in Figure |3l This procedure starts with an arbitrary starting point and 
searches for an optimum using a positive termination scalar, e. Staring at point 
Po, point p 2 is found by a standard gradient step. After that, the optimization 
is continued for n iterations and then restarted with a standard gradient de- 
scent. Each step consists of a standard gradient descent search followed by an 
acceleration. 

PROCEDURE PartanCinput [po , e] , output [p*]); 

/* rj and fi are appropriate step sizes for climbing and acceleration, 
respectively. F represents the function that is used to compute a proper 
step size and n represents the number of independent variables lEP . 

*/ 

BEGIN 

P* =Po; 

REPEAT 

Pk = Po = P* ; 

P* =Pk- V'^fiPk); 

FOR i = 1 to n DO 
BEGIN 

p = r{rj,\/f{p*),n); 
p =p*-pV/(p*); 

S = g -Pk-, 

Pk=P*\ 

p* = g + p-S; 

P = r{g,5,n); 

END; 

UNTIL (|i po-p* ||< e); 

RETURN (p*); 

END. 



Fig. 3. Parallel tangent gradient optimization algorithm. 



To accelerate the convergence of steepest descent based learning algorithms 
(i.e., delta rule and generalized delta rule), we have proposed the use of par- 
allel tangent (Partan) gradient. The proposed technique improves the speed of 
training multilayer neural networks by a large factor, as demonstrated in [^. 

3 Partan Backpropagation 

In practice, the backpropagation training algorithm has proved to be a suitable 
method in computing a weight vector that enables the network to perform certain 
input-output mapping. It teaches a network iteratively. In order to properly train 
the network, an objective function which is the result of the contributions of all 
training samples, is simultaneously minimized. The BP algorithm minimizes the 
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network’s error by continuously adjusting the networks connection weights. The 
weights are updated using the gradient of the cost function, as follows: 

Wj+i = Wi + rjg, 

where g represents the gradient of the cost function and 77 represents the learning 
rate, respectively. Standard backpropagation algorithm (generalized delta rule) 
has the reputation of being very slow. It suffers from the major drawbacks which 
are associated with steepest descent technique. As explained in Section 2 , the 
gradient descent method starts to zigzag near the optimum point. This is mainly 
due to the fact that, using a fixed learning rate to determine the step size may 
not be appropriate for all regions of the error surface. 

The parallel tangent (Partan) overcomes the difficulties of the zigzagging 
phenomena by deflecting the gradient steps. The Partan technique combines 
many desirable properties of the simple gradient method. It uses an accelerating 
step after each gradient step, and can be used as an alternative for momentum 
term to accelerate the convergence. In Partan, connection weights are updated 
as follows: 

Wi+i =w^ + r|g + fis, 

where s represents a direction based on two previous gradient steps and g is 
the accelerating rate. The general framework of the new training technique is 
defined as follows. This procedure can be restarted every n steps, however, global 
convergence is not tied to this restart. 

Begin 

Do one gradient step. 

While (error >= Desired Threshold) 

Do one gradient step. 

Do one accelerating step. 

End 

End 

A proposed detailed algorithm for the Partan backpropagation is given in 
Figure El This procedure starts with an arbitrary starting point and searches for 
an optimum using a positive termination scalar, e. Starting at point wq, point W2 
is found by a standard gradient step. Following the initial step, the optimization 
is continued for n iterations and may restart from another random initial point 
until the optimum weight vector w* is found. After n iterations, one has the 
choice of either continuing the cycle of backpropagation search and acceleration 
or starting over again. In Figure E] we have presented the latter choice. 

The learning rate, 77, plays an important role in the convergence of a net- 
work. Choosing appropriate value for the learning rate can speed up the training 
process. A large learning rate is efficient in the flat regions of the error surfaces, 
but, usually causes oscillation near the optimum point. On the other hand, small 
learning rates tend to slow down the convergence of a network. We have shown 
that the adaptation of the learning rate during the training process speeds up 
the convergence |6j. The learning method proposed in is called dynamic Par- 
tan. In the dynamic Partan, the learning rate is adapted for each gradient step. 
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PROCEDURE Partan_backprop(input [/, D, Wo , el, output[w*]); 

/* I and D are the input and the desired output vectors, respectively. 

Wo is the starting point, w*is the optimum weight vector, and e is a 
termination scalar which is chosen to be greater than zero, rj and p 
are appropriate step sizes for climbing and acceleration, respectively. 
r represents the function that is used to compute a proper step size and 
n represents the number of independent variables. M is the number of 
epochs, backprop is the standard backpropagation procedure that returns 
the gradient of the criterion function at a given point and the amount of 
existing error. 

*/ 

BEGIN 

j = 1; 

w* = Wo; 

REPEAT 

Wk = Wo = w* ; 

Call backprop(input[/,Wfc ,e,D] , output [Vto*. , error]); 
w* = Wk - ; 

FDR i = 1 to n DO 
BEGIN 

Call backprop(input[7,w* ,e,D] , output [V™* , error]); 

IF (error < e) RETURN(w*), EXIT; 
p = r(r;,Vu,- ,n); 
g = w*-p\7w* ; 

5 = g-wu; 

Wk = w* ; 
w* =g + p5; 
fi = r{p,5, n); 

END; 

UNTIL (error< e OR j >M) ; 

RETURN (w*); 

END. 

Fig. 4. Parallel tangent backpropagation learning algorithm. 



The adaptation is done with respect to the properties of the error surfaces. The 
framework of the dynamic Partan training algorithm is as follows: 

Begin 

Do one gradient step. 

While (error >= Desired Threshold) 

Adapt learning rate . 

Do one gradient step. 

Do one accelerating step. 

End 



End 
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In this paper, we propose the use of dynamic accelerating rate, /i, in the Par- 
tan backpropagation. In the proposed method, the accelerating rate is adapted 
prior to taking the acceleration step. The adaptation technique is similar to that 
of the learning rate adaptation. The extended dynamic Partan algorithm is as 
follows: 

Begin 

Do one gradient step. 

While (error >= Desired Threshold) 

Adapt learning rate . 

Do one gradient step. 

Adapt accelerating rate. 

Do one accelerating step. 

End 

End 



3.1 Rates Adaptation 

In standard parallel tangent backpropagation algorithm, the learning as well 
as the accelerating rates are fixed during the process of training the network. 
Moreover, all the connection weights use the same learning and accelerating 
rates. In dynamic Partan, the learning rates are adapted continuously over time. 
The adaptation is done with respect to the shape of the error surfaces. The sign 
of the gradient is used to adapt the learning rates. If consecutive changes (local 
gradients) of a connection weight posses the same signs, the learning rate for 
that connection is increased [T]. This increase helps to take a longer step in the 
next iteration. If consecutive gradients posses opposite signs, it shows that the 
previous learning rate has been too large and a jump over the local minima has 
occurred. Thus, the next step should be carried out with smaller learning rate. 
This is done by removing the effect of the previous step (i.e., backtracking one 
step) and decreasing the learning rate approprietly. 

In the dynamic Partan, the accelerating rate is fixed during the training of the 
network, whereas, in the extended dynamic Partan proposed in this paper, each 
connection has its own accelerating rate. The accelerating rates are also adapted 
dynamically as the training proceeds. The adaptation of the accelerating rates 
is done similar to that of the learning rates. The accelerating rate is increased or 
decreased whenever the corresponding learning rate is increased or decreased. 

Four dynamic Partan schemes called Partanl, Partan2, PartanS, and Partand 
are presented in this paper. Dynamic Partanl and 2 use variable learning rates 
and fixed accelerating rates during the training process. In Dynamic Partanl, 
the learning rates are adapted as follows, 

m+i = 

=ni*iy~, 

where v~are the adaptation rates used to increase and decrease the learn- 

ing rates, respectively. Whereas in dynamic Partan2, adaptation rates v~) 
are added/subtracted to/from the learning rates. 
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Dynamic PartanS and 4 are extended versions of dynamic Partanl and 2. In 
these schemes, dynamic learning rates as well as dynamic accelerating rates are 
used during the training process. The accelerating rates are adapted similar to 
the adaptation of the learning rates in the dynamic Partanl and 2. In dynamic 
PartanS, the accelerating rates are adjusted as follows, 

fii+i =r]i*T+, 

Hi+i =rii*T~, 

where and t~ are the adaptation rates used to increase and decrease the accel- 
erating rates, respectively. In dynamic Partand, adaptation rates are added/sub- 
tracted to/from the accelerating rates. 



4 Simulation 

In order to evaluate the performance of the extended dynamic Partan learning 
algorithm, simulation studies are carried out on different learning problems. The 
learning problems are chosen so that they possess different error surfaces and 
collectively represent an environment that is suitable to determine the effect of 
the proposed learning algorithms. For all the methods presented in this paper, 
the backpropagation procedure is used to calculate the partial derivatives of the 
error with respect to each weight. 

The network architecture are predetermined, specifying the number of hid- 
den units, the step sizes rys and /iS, the number of patterns in the training set, 
and the convergence criterion, e, which was set so that the average error per 
pattern in the training set is bellow some threshold. For the standard backprop- 
agation networks, we have selected the architectures and learning parameters 
(i.e., learning and momentum rates) that resulted in good performance. The 
same parameters and architectures are used for different Partan schemes. Ideal 
architectures for Partan algorithms may even show faster rate of convergence. 
The simulation studies are carried out using a large number of learning prob- 
lems. The results for Sin function. Sonar classification problem and a character 
recognition problem are summarized in Tables 1-3. 

At the start of each simulation, the weights are initialized to random values 
between +r to — r. Since the backpropagation algorithm is sensitive to different 
starting points, we carried out our simulation with various runs starting from dif- 
ferent random initializations for the weights of the network. For each algorithm, 
25 simulations were attempted. 

Four schemes of the dynamic Partan namely Partanl, Partan2, PartanS, and 
Partand are implemented. Dynamic Partanl and 2 use variable learning rates and 
fixed accelerating rates, whereas, dynamic PartanS and 4 use dynamic learning 
rates as well as dynamic accelerating rates during the training process. 

The results of the simulations for the above problems are summarized in Ta- 
bles 1-3. In these tables, a represents the momentum rate used in the standard 
backpropagation training algorithm. The results shown in these tables clearly 
indicate that dynamic PartanS and dynamic Partand exhibit a faster rate of 
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convergence comparing to the standard backpropagation and the standard Par- 
tan. 



4.1 Sin Problem 

In this problem, neural networks are used for the function approximation. The 
network architecture used for this problem consists of one input unit, one output 
unit with linear activation function, and two hidden layers with 8 and 3 sigmoidal 
units, respectively. The training is considered complete when the cumulative 
error is below 0.01. The training and test sets consist of 60 and 360 points in 
the range [— tt , -l-7r], respectively. 

The results of our simulation studies are summarized in Table [T] It is seen 
that on the average, the standard backpropagation with 77 = 0.6 and a = 0.5 
converges after 218 epochs. The dynamic Partanl with the same learning rate 
as that of the standard backpropagation and ^ = 0.8 converges after 125 epochs. 
Dynamic Partanl converges after 85 epochs. The results show that, on the av- 
erage, the dynamic Partanl converges twice as fast as the standard Partan and 
2.56 times faster than the standard BP algorithm. 



Table 1. The training results for Sin function. 



Learning Parameters 


r 


v 


a 






V 


r+ 


T 


Avg. Epochs 


Backpropagation 


0.5 


0.6 


0.5 


- 


- 


- 


- 


- 


218 


Standard Partan 


0.5 


0.19 


- 


0.76 


- 


- 


- 


- 


167 


Dynamic Partan 1 


0.5 


0.6 


- 


0.8 


1 


0.7 


- 


- 


125 


Dynamic Partan 2 


0.5 


0.95 


- 


0.8 


0 


0.05 


- 


- 


126 


Dynamic Partan 3 


0.5 


0.6 


- 


0.5 


1 


0.7 


0.01 


0.02 


88 


Dynamic Partan 4 


0.95 


0.6 


- 


0.6 


1 


0.7 


0.95 


0.83 


85 



4.2 Sonar Problem 

This problem is the classification of sonar signals using neural networks. The task 
is to train a network to discriminate between sonar signals bounced off a metal 
cylinder and those bounced off a roughly cylindrical rock. The problem contains 
of 101 training examples (19 Mine patterns and 55 Rock patterns) and 101 test 
examples (62 Mine patterns and 12 Rock patterns). Each pattern consists of 60 
numbers in the range of 0.0 to 1.0 as input and 2 binary values as output. The 
network architecture used for this problem consists of 60 input units, 21 hidden 
units and 2 output units. 

The simulation results for this problem are given in Table |21 The training is 
considered complete when the error for one epoch is less than 0.1. It is seen that 
dynamic Partan3 converges about 2.5 times faster than standard BP and about 
1.6 times faster than standard Partan. 
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Table 2. The training results for Sonar, Mines vs. Rocks Problem. 



Learning Parameters 


r 


V 


a 






V 




r 


Avg. Epochs 


Standard BP 


0.3 


0.4 


0.9 


- 


- 


- 


- 


- 


390 


Standard Partan 


0.3 


0.2 


- 


0.67 


- 


- 


- 


- 


257 


Dynamic Partan 1 


0.35 


0.95 


- 


0.74 


1 


0.29 


- 


- 


199 


Dynamic Partan 2 


0.35 


0.98 


- 


0.7 


0 


0.205 


- 


- 


203 


Dynamic Partan 3 


0.3 


0.32 


- 


1.5 


1 


0.5 


0 


0.01 


160 


Dynamic Partan 4 


0.3 


0.98 


- 


1.1 


1 


0.29 


1 


0.98 


180 



After training the network, its memorization and generalization abilities are 
examined with tests as well as training patterns. The results of testing the net- 
work with e = 0.1 and e = 0.07 are given in Table It is seen that dynamic Par- 
tan schemes show stronger memorization and generalization capabilities. When 
the error threshold is set to 0.07, the dynamic PartanS is able to memorize all 
the training examples and correctly classify 91 percent of unseen patterns from 
the test set. For the similar cases, standard BP and standard Partan show 96% 
memorization capability; and 81% and 86% generalization abilities, respectively. 



Table 3. The memorization and generalization results for Sonar problem. 





Error Threshold = 0.1 


Error Threshold = 0.07 


Algorithm 


Epochs 


Memorize. 


Generalize. 


Epochs 


Memorize. 


Generalize. 


Standard BP 


338 


94% 


80% 


390 


96% 


81% 


Standard Partan 


179 


95% 


85% 


257 


96% 


86% 


Dynamic Partanl 


169 


96% 


85% 


199 


96% 


82% 


Dynamic Partan2 


169 


96% 


86% 


203 


96% 


86% 


Dynamic PartanS 


136 


96% 


90% 


160 


100% 


91% 


Dynamic Partanl 


149 


97% 


89% 


180 


98% 


90% 



4.3 Character Recognition Problem 

The network architecture used for solving this problem consists of 64 input units, 
twenty five hidden units, and one output unit. The task is to train a multilayer 
neural network to recognize English capital letters. Each letter is represented 
as an 8 X 8 matrix of Os and Is. There are 24 different patterns for each letter. 
Six patterns represent positional movements of the letter inside the matrix (i.e., 
moving the letter up, down, left or right inside its 8 x 8 matrix). Each pattern is 
represented with four different angles (0, 90, 180, 270 degrees). In other words, 
besides the original representations of the letters, there are 3 more patterns for 
each representation of a letter that show the state of that letter after being 
rotated 90, 180, and 270 degrees. 

The training and test sets each consists of 628 patterns. The training is 
considered complete when the error of one epoch is less than 0.01. The training 
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Table 4. The training results for character recognition problem. 



Learning Parameters 


r 


n 


a 






P 


r+ 


T 


Avg. Epochs 


Standard BP 


0.5 


0.01 


0.5 


- 


- 


- 


- 


- 


3703 


Standard Partan 


0.5 


0.007 


- 


0.7 


- 


- 


- 


- 


1936 


Dynamic Partan 2 


0.5 


0.001 


- 


0.9 


0.003 


0.01 


- 


- 


663 


Dynamic Partan 3 


0.5 


0.002 


- 


0.84 


0.003 


0.01 


0.015 


0.0025 


543 


Dynamic Partan 4 


0.5 


0.002 


- 


0.7 


0.003 


0.01 


1.22 


0.95 


539 



results for this problem are shown in Table 01 It is seen that the dynamic Partand 
converges about 7 times faster than the standard BP and about 3.6 times faster 
than the standard Partan. 

5 Conclusion 

Parallel tangent (Partan) gradient is a deflecting method that combines many 
desirable characteristics of the simple gradient method and has certain ridge- 
following properties which make it attractive. The Partan as well as the dynamic 
Partan are used to accelerate the convergence to the solution of backpropagation 
learning algorithm 1 31416 1 . In this paper, we have proposed two extensions to 
dynamic Partan training algorithm. 

The main features of parallel tangent technique are its simplicity, ridge- 
following, and ease of implementation. The most desirable property of Partan 
backpropagation , however, is its strong global convergence characteristics. Each 
step of the process is at least as good as the steepest descent; the additional 
move ( acceleration) to Pi+i provides further decrease of the objective function. 

In dynamic parallel tangent, the local information is used for the adaptation 
of the learning as well as the accelerating rates. The local adaptation of the rates 
is ‘similar’ to biological neural learning adaptation process and is more suitable 
for parallel implementations. We have demonstrated through simulation that the 
dynamic adaptation of rates is an effective approach to speed up the training 
of multilayer neural networks. The networks energy function behaves differently 
in various dimensions during the training process. This concept is simulated by 
using and dynamically adapting different learning and accelerating rates for the 
connection weights. 

In all the problems we have studied so far, the convergence of the dynamic 
Partan was faster than the standard BP as well as the standard Partan. Table 
[^depicts the speedup achieved for the three learning problems studied in this 
paper. The results show that on average the rate of convergence of the standard 
Partan and the dynamic Partanl-4 are approximately 1.56, 2.70, 3.185, 4.36 and 
4.3 times faster than that of the standard BP algorithm, respectively. 
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Table 5. Average speedup of standard and dynamic Partan versus standard 
backpropagation . 



Learning problem 


s-Partan 


d-Partanl 


d-Partau2 


d-Partan3 


d-Partan4 


Sin function 


1.30 


1.74 


1.73 


2.47 


2.56 


Sonar problem 


1.51 


1.95 


1.92 


2.43 


2.16 


Char, recognition 


1.91 


- 


5.58 


6.81 


6.78 


Average speedup 


1.56 


2.70 


3.18 


4.36 


4.3 
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Abstract. The PAC learning theory creates a framework to assess the 
learning properties of models such as the required size of the training 
samples and the similarity between the training and training perfor- 
mances. These properties, along with stochastic stability, form the main 
characteristics of a typical dynamic ARX modeling using neural net- 
works. In this paper, an extension of PAC learning theory is defined 
which includes ARX modeling tasks, and then based on the new learn- 
ing theory the learning properties of a family of neural ARX models are 
evaluated. The issue of stochastic stability of such networks is also ad- 
dressed. Finally, using the obtained results, a cost function is proposed 
that considers the learning properties as well as the stochastic stability 
of a sigmoid neural network and creates a balance between the testing 
and training performances. 



Keywords: Neural Networks, Evolutionary Programming, Learning Theory, 
Nonlinear ARX Models 

1 Introduction 

In a dynamic modeling task in presence of additive noise, the output of a system 
is expressed in terms of a function of the history of the input as well as the history 
of output. In the case of a nonlinear ARX (also known as NARX), assuming 
that ut-q+i, Ut-q+ 2 , ■■■ Ut-d describe the history of the input variable and 
yt-k, yt-k+i, ■ ■ ■ Vt-i that of the output, then: 

yt — f{yt—ki yt—k+l-! ■ • ■ yt — li g-t-l? '^t—q+ 2 : • ■ • '^t — d) “t” Ct (1) 

where d, q — d — 1, k and Q represent the degree of the input, the delay 
from the input to the output, the degree of the output and the additive noise on 
the system, respectively. Although one can consider multi-dimensional models. 
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here the focus is given to the single-input/single-output (SISO) case. It is also 
assumed that ut and Q are uncorrelated sequences of independently and iden- 
tically distributed (i.i.d.) random variables. The Markov process formed as o 
includes a wide range of dynamic models used in engineering applications such 
as dynamic neural networks. As a result, all properties of NARX models can be 
further specified in the case of a particular dynamic neural model. One of the 
most important properties of a NARX model to be investigated is the stochastic 
stability of the model. This issue has been addressed in the literature assuming 
different definitions for stochastic stability, resulting to different sufficient stabil- 
ity conditions for NARX models. The concept of Lagrange stability defines 
a notion of stability based on the relatively compactness of the sets generated 
by a stochastic difference equation. Kushner’s work on stochastic stability 
has provided a more comprehensive mathematical framework for assessment of 
stochastic stability continuous and discrete systems. The most comprehensive 
results in this field come from the school that explores the relation between the 
stochastic Lyapounov stability (as in Kushner’s work) and the concept of ge- 
ometric ergodicity The results of this line of research not only provide 

simple practical notions of stochastic stability, but they also create a foundation 
for assessment of other statistical properties such as learning properties of dy- 
namic models |B]. In all learning paradigms presented for dynamic models [7], 
|H], the assumption of models (processes) being geometrically ergodic is treated 
as a fundamental requirement, i.e. learning properties of dynamic models can 
not be evaluated unless the assumption of models being geometrically ergodic is 
verified. This further calls for evaluation of geometric ergodicity for important 
families of nonlinear dynamic models. The concept of geometric ergodicity is a 
property that has been investigated for a variety of stochastic models and seems 
to be an appropriate measure of stochastic stability as well as learning. Here, 
a general results of [4j is applied to the special case of neural modeling with 
sigmoid neural networks and specific sufficient conditions are presented under 
which the model is geometrically ergodic. 

Having obtained a set of sufficient conditions for geometric ergodicity of 
SNN’s, the results are used to assess the learning properties of neural ARX 
models. In order to do so, first the learning theory is extended to the learning with 
strong mixing data. Then, specific upper bounds on the sample complexity of 
such models are given. Finally, the learning properties of neural ARX are applied 
to define complexity measures (along with their corresponding cost functions) 
that can be used in practical applications. 

This paper is organized as follows: Section 5.2 describes the basic learning 
and stability definitions applied to nonlinear ARX modeling. In Section 5.3, 
some of the results on learning and stability properties of neural ARX models 
are reviewed. In the same section, the resulting learning theory is applied to 
ARX SNN’s to bound the sample complexity of such learning tasks. Section 5.6, 
uses the results of the previous sections to describe a learning-based algorithm 
that searches for neural models with minimum complexity and is followed by the 
conclusions. 
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2 Basic Definitions of Stochastic Stability and Learning 

In this section, some of the basic concepts of geometric ergodicity as well as the 
existing results on the stochastic stability of NARX models are reviewed. Let Xt 
be a Markov chain with the state space B), B being the collection of Borel 
sets. The t-step-ahead transition probability of Xt is denoted by P"(a;, A), i.e. : 

P\x, A) = P{Xt e A\Xo = x), xe RP, AgB. (2) 

Now the concept of geometric ergodicity is defined: 



Definition 2.1 Xt is geometrically ergodic if there exists a prohahility measure 
7T on (RP,B), a positive constant p < 1, and a tt - integrable non-negative mea- 
surable function w such that: 



\\P\x, .) - 7r(.)||yar- < P*w{x), X & RF 



where ||.||i/ar denotes the total variation norm. 



(3) 



Next, the notion of “a-mixing“ (also known as strong mixing), is defined 
that describes a type of stationary random process with exponentially weakening 
dependency. 



Definition 2.2 Let Li and V be two sub a-algebras of some a-algebra A. Then 
a measure of dependency is defined as: 



a{U,V) = svip{\Br{U)Br{V) -Br{U nV)\-U R G V} (4) 

Now, if A = Ut-oo ^ = ytoi y = yto-i-t, and assuming that is a 
stationary process, then a(3^toj3^to+t) ^aes not depend on to and can be denoted 
as: avit). If ay {t) approaches zero as t ^ 0, the process Y is called a-mixing. 

Moreover, suppose ay ft) approaches zero geometrically fast in t, i.e., there 
exist fci, k 2 , ks > 0 such that: 



ay{t) = kie ^ (5) 

Then, the process is called geometrically a-mixing. 

Next, the notion of PAC learning with geometrically a-mixing data is re- 
viewed (see 0), which is the natural extension of the conventional PAC learning 
to geometrically a-mixing cases. One can easily omit “geometrically” and define 
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the learning scheme for a general family of learning tasks where data is a-mixing 
but not essentially “geometrically a-mixing” . However, in order to specialize the 
results towards the geometric case, here the focus is given to the geometrically 
mixing cases only. 



Definition 2.3 Suppose that z„ is a set of input- output data where {x\, . . . ,Xn) 
is a sequence of geometrically a-mixing r.v.s , marginally distributed according 
to the probability measure P G V. Then, a function set T is said to be “PAC 
learnable with geometrically a-mixing data” iff an algorithm A can be found 
based on which for any e and S, there exists n such that: 



sup Pr{dp{f ,h) < e} > {1 - S) (6) 

/e^ 



Throughout this paper, it is assumed that: 

dp{f,g) = fx lf(x) - h(x)ldP(x) 

However, following a similar approach, the results can be extended to other 
distance measures. 

Another important concept in learning theory is an e-cover of a function set. 



Definition 2.4 An e-cover of a function set T is defined as a set of functions 
{gi}1-i in T such that for any function f G T , there is a function gj where: 
dp{f,9j) < e- 

It should be noted that an e-cover for a function set T may or may not exist. 
If such a cover set exists, the cardinality (size) of the set depends both on the 
value of e and function set T . 

A specific type of learning algorithm known as “the empirical risk minimiza- 
tion algorithm”, which is used in this paper is now defined: 



Definition 2.5 Let e > 0 be specified, and let {(/*}?=! e/2-cover of P with 
respect to dp. 

Then the empirical risk minimization algorithm is as follows: Consider a set 
samples as defined above. Define the cost functions: 

= i YTj=i \f{xj) - 9i{xj)\ , i=l,...,q 
Now, the output of the algorithm is a function h = gi such that: 
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3 Stability and Learning of Neural ARX Models 

Definition tnD shows that geometric ergodicity is an indication of stability. 
According to in a geometrically ergodic process, the transition probability 
approaches a (possibly unknown) well-behaved probability measure tt geomet- 
rically fast. The following theorem by Mokkadem m is known to be the most 
general result on geometric ergodicity of Markov processes. 



Theorem 3.1 (Mokkadem JiTW ) Suppose that for a Markov chain Xt, there ex- 
ists a non-negative and measurable function V (called a Lyapounov function), 
and constants c > 0 and 0 < p < 1 such that: 

E{V{Xt+i)\Xt = x)<pV{x)-c (7) 

Then Xt is a geometrically ergodic process. 



Inequality o shows why V{.) is called a Lyapounov function and the notion 
of geometric ergodicity parallels the concept of stochastic stability. 

Next, the focus is given to the existing results on the Markov process of form 
(P. Here, a theorem by Doukham [3], which presents a set of sufficient conditions 
for geometric ergodicity of the process dTJ is reviewed. 



Theorem 3.2 [Doukham Consider the process m- Let: 

Xt = {yt—k^ • • ■ Vt — l: — g-t-lj '^t—q+2^ ■ ■ ■ '^t — d) (8) 

Assume that Xt^i indicates the ith element of Xt- Also, assume the followings: 

1. There exist a number xq > 0 and non-negative constants ipi, ..., ipk, a 
locally bounded measurable function h : R ^ R'^ , and a positive constant 
c such that: sup^^Xt\\<xo\fi^t)\ < oo (where ||Alt|| is the Euclidean norm of 
Xt), and 



k q-d-\-k 

\f{Xt)\<J2^,\Xt,,\+ (9) 

j=i j=k-ei 

if ||X(|| > xo 

2. if[|Ci|] + (? — d)E[h{ui)] < c < oo 

Then, if the unique non-negative real zero of the “characteristic polynomial” 
P{z) = — . . . — ipk is smaller than one, the process is geometrically 

ergodic. Moreover, if the process X is stationary, the process “y” (i.e. {yt}t^-oo) 
is geometrically a-mixing. 
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Although the details of the long proof presented for this theorem (by 
Doukham) are not repeated here, a brief description of the general scheme of 
the proof will be given. The proof starts with introducing a Lyapounov function 
of the form: 



k q—d+k 

V{Xt) = Y^a,\Xt,,\+ Y. (10) 

i=i j=k+l 

Then the appropriate choices of aj’s and Pj’s to satisfy CZD are investigated. 
It is then proved that if all the assumptions made in the theorem hold, the 
condition set on the zeros of the characteristic polynomial P{z) guarantees geo- 
metric ergodicity. As can be seen, the sufficient conditions set in Theorem (13.211 
guarantee the stochastic Lyapounov stability as well as the geometric erogicity 
of the model. 

The following theorem uses the results of Theorem jd] to present a set of 
sufficient conditions for stochastic stability and geometric ergodicity of the fam- 
ilies of sigmoid neural network discussed above. These conditions are set on the 
known parameters of the network, and as a result can be easily tested during a 
practical modeling task. A family of atan sigmoid networks is first considered: 



Theorem 3.3 (Najarian Let: 

Af = (y^_/j;, Ut-q-\-2t • ■ ■ — ) 

Take yt, Ct ut as defined in m- Also assuTTie that f is a sigmoid neural 
network with I neurons of the following general form: 

fii^) = f ELi (bix) 

Also assume: x = Xt where: p = q — d + k. Further assume that if[|Ct|] < 
and A[|ut|] < M„. Define: 



^ 2 

Wj = V -|a,||6y| (11) 

^ ' 7T 
i=l 

where j = l,...,fc. Suppose: — max^o;. Also define the following charac- 

teristie polynomial: P{z) = z^ — u>iz^~^ — ... — ujk- Then the sequence Xt is 
geometrically ergodic if the unique non-negative real zero of P{z) is smaller than 
one. Also, if Xt is stationary then ’y’ is geometrically a-mixing. 

The proof of this theorem, being too long (see [HI), is not given here, however 
the general sketch of the proof is presented. The proof follows the method intro- 
duced by Doukham (as described above) and starts with defining a Lyapounov 
function based on the weights of the neural network. Then, the conditions are 
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found such that the introduced Lyapounov function guarantees geometric ergod- 
icity. 

Having dealt with the conditions for stochastic stability and a- mixing of 
SNN’s, one can move on to the next step which is applying the PAC learning 
scheme to a task of neural ARX modeling. 



Theorem 3.4 (Najarian Consider a sigmoid neural network as defined 
above. 

Let T\ be a set of sigmod neural networks with I neurons. Also, define: 

h= (12) 

For a vector w = {wi, W 2 , ■ • ■ Wd), use the notation |w|i as: 

Assume that: \bi\i < Ti and ^ ^ ■ 

Then, the empirical risk minimization algorithm provides PAC learning with 
geometrically a-mixing of Ti, i.e. for any e and S there exist n such that: 



sup Pr{dp{f, h) < e} > I — (2e(4C + e)/e)* 

fe:F, 



]^(2e(6rfcC + e)/e)“ 



(l + 4e ^ki)exp 



,fc=i 



2 - 
— e^n 



64(2+ 



or equivalently: 



,5> (2e(4C + e)/e)' 
(1 + 4e“^fci)e2:p 



PJ(2e(6rfcC' + e)/e)“ 



.fc=l 



[64(2+ ^)J 



(13) 

The proof of this theorem is also too long to be mentioned here and can be 
found in [S]; however the general sketch of the proof given here might be insight- 
ful. The proof (inspired by Barron’s work on the covering sets of neural networks 
d!) starts with introducing a constructive method of forming a distribution- 
free |-cover of the function set +/. Then, the statistical properties of dp{f,h) 
for each / are related to a-mixing properties of the data. Finally, the size of the 
|-cover set and the statistical properties of dp{f, h) are combined to create the 
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resulting inequality. It is important to notice that the proof is a constructive 
one and gives a systematic method to perform the empirical risk minimization 
algorithm. 

A brief glance at the results of the above theorem reveals that the introduced 
bounds are highly conservative. Also, in many real applications, the actual values 
of fci, k 2 and may not be known. The direct use of the above bounds may fit 
applications where huge training data sets, along with some information on the 
statistical nature of the data are available. In some applications, one may first 
use the data to estimate k\, k 2 and fcs, and then apply the above bounds. How- 
ever, in many applications, the available training sets are small, and performing 
another estimation task, merely to find k\, k 2 and ks may not be desirable. In 
the next section, the obtained results are used to define complexity terms to be 
incorporated into the cost functions. 

One can easily extend the results of the above learning scheme to a more 
general paradigm of model-free learning m- This would enable us to deal with 
cases where data is noisy or the system that generates the data is not a neural 
network of known structure. 

4 Minimum Complexity ARX Neural Modeling 

In this section, new complexity terms are defined based on the functionalities of 
the available bound for <5 (or more specifically ln(i5)). The cost function is then 
defined as the linear combination of the empirical error and the complexity term. 
Notice that S (or ln((5) ) describes the degree of uncertainty over the accuracy of 
the model. Here, by defining the complexity term as ln((5) and incorporating that 
into the cost function, the uncertainty over the model is minimized throughout 
the training procedure. Noe notice that Inequality m gives the following bound 
on ln(5): 



ln(i5) > I In (2e(4C -I- e)/e) -I- d In ((2e(6rfcC -|- e)/e)) 






-I- ln(l -I- 4e ^ki) — 



64(2+^) 



(14) 

A brief look at Inequality dH shows that since fci, k 2 and k^ are often 
unknown (as described above), one can not define a complexity term that en- 
compasses all statistical properties of the modeling procedure accurately. Since 
the value of k\ is not available, one can exclude the term: ln(l -|- 4e“^/ci) from 
the complexity term. Although by omitting this term, some of statistical char- 
acteristics of the data disregard are, making assumptions on a variable that is 
unknown is avoided. As to the value of n things are more complicated. 
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Here, some choices for k 2 and are assumed that seem to be reasonable for 
a typical modeling application; however, these choices are by no means meant 
to be the best possible selections and can be replaced by any better estimations 
or assumptions. It had to be stressed again that the direct estimation of these 
values from the training data might give a better set of values for k 2 and ^3, but 
since this would involve another estimation problem, this approach is not used 
here. In choosing the value of k 2 , one can simply set /c2 = 1- This is because is 
most applications, the decrease in correlation of data is not too fast. As to fca, 
this value is selected to be of form: fc2 = 1/fc where k is the degree of the system 
(as defined in &)■ This way, one can ensure that the correlation decreases more 
slowly when the degree of the system is higher. Also, notice that the defined 
complexity term depends on the values of C and r^. Since in a practical modeling 
procedure, the values of these variables are not known beforehand, this might 
make the defined complexity term less practical. Notice that C and Tk are merely 
bounds on the size of the network weights. Therefore, in order to obtain a more 
practical complexity measure, one can replace C and with Yl\=i I®* I l^fe|i> 
respectively. This gives the following practical complexity measure: 



Csnn = Hn ( 2e(4(^ |a*|) + e)/e 

V i=i 

+ dg In (^(2e(6|(,.k(g |n.|) + 



(15) 



Based on the above complexity term, a cost function as follows can be defined: 



J snn — 

n 






.2 = 1 



ACs. 



(16) 

Higher values of A may give close testing and training errors that are both 
too large, while the smaller values of A may result to small training error and 
large testing one. The choice of A should be made according to the objectives 
of the specific application in hand. A practical approach would be starting with 
very small values of A and performing the optimization process. If the empirical 
error is small, then the value of A can be increased and the optimization can be 
performed with the new A. This cycle can be repeated (i.e. A is increased) as long 
as the empirical error is still small enough. As soon as the empirical error starts 
to become undesirably large, the last value of A can be used as an appropriate 
choice of A. 
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In forming the above cost function, it is assumed that the process is geo- 
metrically a-mixing. However, as mentioned before, this property can not be 
assumed easily and must be checked. If the output process is known to be sta- 
tionary, then the results given in this paper show that if the positive real root 
of the characteristic equation is less than one, the process is guaranteed to be 
geometrically ergodic, stochastically stable and geometrically a-mixing. In order 
to ensure that the positive real root is indeed less than one, one can extend the 
optimization process and try to minimize the magnitude of this root throughout 
the training process. In order to do so, a new term can be added to the cost 
function, i.e. assuming that PRR (positive real root) represents this root and 
7 > 0, a new cost function can be suggests as follows: 



The value 7 describes how important it is for us to make sure that the model 
is actually stable and geometrically a-mixing. It can be imagined that in many 
real applications, the engineers are not particularly concerned about verifying 
the geometrically a-mixing condition, and this property can be assumed without 
checking. However, in many applications, the system to be modeled is known to 
be stable and it is often desirable to obtain “stable models” for “stable systems” . 
Then, one has to choose 7 large enough to make sure that PRR is actually less 
than one. Having a stable neural network that accurately models a stable system 
is considered to be a major objective in dynamic neural modeling. 

In order to minimize the complex cost function presented above, one needs 
to use an optimization algorithm that can handle nonlinear and non-smooth 
cost functions. Algorithms based on Evolutionary Programming are known to 
be successful in handling such optimization processes (see jO] as an example). 

5 Conclusions 

In this paper, the conventional PAC learning has been extended to the learn- 
ing with strong mixing data. Using this framework, the learning properties of 
neural ARX modeling of complex dynamic models have been addressed. Also, 
a sufficient condition for the stability of the resulting neural network is intro- 
duced. Finally, based on the obtained results, a cost function is introduced that 
creates a balance between the empirical error and the complexity of the model 
and guarantees the stability of the resulting network. 
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Abstract This article presents the notion of qualitative descriptor, a 
theoretical tool which describes within the same formalism different ap- 
proaches to transform quantitative data into qualitative data. This for- 
malism is used with a grouping algorithm to extract qualitative phases 
from a data flow. Work on action perception, based on qualitative de- 
scriptors, is used to illustrate these ideas. The grouping algorithm gener- 
ates a qualitative symbolic data flow from a video sequence. The ultimate 
aim is to provide an unsupervised learning algorithm working this qual- 
itative flow to extract abstract description for common actions such as 
“take”, “push” and “pull”. 



1 Introduction 

Perception is probably one of the most challenging problems in Artificial Intel- 
ligence. Even if we could design a smart system to compute difficult and subtle 
reasoning inferences, we still need a strong perception capability for the system 
to be able to learn and to be of some use in the real world. Pattern recognition 
mm, artificial vision or active vision [3j are some well known research areas that 
address this problem. 

In this work, we focus our attention on symbolic representation, which means 
that we will not discuss the approaches based on connectionist, non-symbolic 
views. We assume that the aim of the system is to generate a complete or par- 
tial symbolic representation of the world it perceives. In 1998, Stuart Russel 
proposed a challenge for Artificial Intelligence: building a system capable of 
driving a car from Paolo Alto to San Francisco. His point of view was to use 
only numerical, non symbolic information with no internal representation of the 
problem. This contradicts the usual trends of classical AI, which are mainly con- 
cerned with symbolic data. The approach we propose lies somewhere between 
these two views, as we use symbolic data flows computed from numerical data 
flows. In fact, we believe the whole perception problem starts with this well- 
known transition from numerical data to symbolic and qualitative data. This 
will be the main scope of the article. 
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2 Description of the Problem 

In computational terms, our problem is to design a transition system that allows 
us to transform quantitative data into qualitative data. But we want a little more 
than just a translation from numeric to symbolic. We also want, as we said in 
the previous section, a system capable of extracting qualitative properties of the 
data and relations between them. Our system should be designed as follows: 

Input: raw data flow, most of the time numerical (for example, images, 
numerical flows), but not only. 

Output: a qualitative flow and a set of qualitative properties describing 
certain qualitative regularities in the flow. 

Constraint: real-time processing : the qualitive output should be available 
at the same pace than raw data flow input. 

Properties: a modular system, each qualitative property being described by 
a single qualitative descriptor. Because they are reusable, the set of qualitative 
descriptors is a kind of toolbox. 

Qualitative descriptors are the tools we use to build such a system. 

3 Qualitative Descriptors 

A qualitative descriptor can be seen as a qualitative processor, described by a 
quadruplet {D,Q,T,G). It receives a flow of data as an input and produces a 
flow of qualitative, symbolic data as an output. The type of input data will be 
called the domain D of the descriptor. In practice, this domain can be anything, 
not only a numerical domain. For example, the domain of a qualitative descriptor 
can be the output of another qualitative descriptor. This leads to networks of 
interconnected descriptors, as we shall see below. 

A qualitative transform function T is used to change the data of D into 
qualitative data of Q, the qualitative description domain. The elements of Q will 
be called qualitative types. The important idea is that there is a generalization 
function G from to Q in order to generalize two qualitative types of Q. 
This particularity is used to detect regularities and common points inside the 
qualitative flow. 

Usually, Q will be a set of first order predicates, for it defines a natural 
generalization function G. 

3.1 Order of Descriptors 

The notion of descriptor order is very important. Some descriptors transform one 
element of D into one of Q. This is a simple function that abstracts information, 
according to a specific qualitative perspective. It will be called a qualitative 
descriptor with order 1. For example, a very simple first order descriptor could 
be working on the sign of a real number, which is a qualitative feature of this 
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number. If a; < 0 then T{x) = [neg] else T{x) = [pos]. Note that we adopt the 
convention of writing elements of Q between square brackets. 

Some descriptors need two elements of D to produce a qualitative response 
in Q. The typical example is a descriptor stating whether two numbers are in 
decreasing or increasing order. If a; > p then T{x,y) = [>] else T{x,y) = [<]. 
They are called second order descriptors. 

In a way, the order of a descriptor can be seen as the arity of the function 
T. First order descriptors just transform one piece of information into another. 
Second order descriptors give information about the evolution of some elements 
of D. One could speak of static and dynamic descriptors and the notion of order 
is then closely related to the notion of derivative order in mathematics. 

3.2 Examples 

Here are some typical examples of qualitative descriptors. The first and second 
ones are not directly used in the action perception problem but can be useful 
in other situations (see |5). The third is used in the action perception problem 
and is quite typical too since it uses predicates as an output. 

Alignment Descriptor For this qualitative descriptor, the domain is the set 
of points in the plan. It is a second order descriptor, and therefore it needs two 
points as an input. The qualitative result given by the function T is a description 
of the line going through the two points. The qualitative representation of a line 
(i.e. the elements of Q) are couples made up of a director vector and one point of 
the line, such as [D{v,P)]. Two lines can be generalized into a line, a vectorial 
line or nothing, as follows: 

GiD{v^,Pi),D{vi,Pi)) = D{vi,Pi) 
G{D{v^,Pi),D{vi,P 2)) = D{vi,*) 
G{D{v^,Pi),D{v2,P2)) = 

Note that we have not expressed how points are described for this descrip- 
tor. In any practical implementation, a cartesian representation, using a given 
reference, will be appropriate. In this case, the domain becomes IR^, but it is not 
necessary. 

Monotonicity Descriptor This descriptor operates on number flows. Its do- 
main is IR. It is a second order descriptor which describes the monotonicity of a 
series of numbers. The function T is: 

if X > y,T{x,y) = [dec] 
if X < y,T{x,y) = [inc] 
if X = y, T{x,y) = [equal{x)] 

As a generalization, we could propose G{[equal{x)], [equal{y)]) = [equal{*)]. 
In other cases, G(A, Y) = [*] ii X ^ Y and G{X, X) = X. This could allow the 
system to recognize a staircase function for example. 
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Move Descriptor This qualitative descriptor is given a speed as an input. 
The domain is then a vectorial space. If the speed module is not zero, with a 
certain tolerance, the qualitative result given by T is a predicate move. The 
generalization is the natural generalization used on predicates. In practice, the 
vector in input is associated with a label to name the object whose speed is 
given. Here, we call it a for example. 



if |r;| > e, T{v) = [move{a)\ 
if |t>| < e, T{v) = [] 

3.3 Grouping Algorithm 

A grouping algorithm has been designed (see i) to use the generalization capa- 
bilities of qualitative descriptors. This algorithm works in real time to provide 
a qualitative flow of elements of Q from a flow of element of D, given a specific 
qualitative descriptor. To illustrate the results of this algorithm, we will use a 
simple example. Let us consider a flow of couple (a, h) where a and b are numbers. 
We will use the algorithm together with a monotonicity descriptor operating on 
both elements of the couple. The following figure describe the expected result 
when running the algorithm. 
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The labeled arrows on the left are qualitative groupings generated by the 
algorithm. They represent maximal groupings in the sense that it is impossible 
to extend them without changing their qualitative type. The algorithm generates 
every maximal grouping, according to the generalization specification of the 
qualitative descriptor. 



Short Description Here is a short description of the algorithm. We consider a 
flow of elements of D, analyzed by a given descriptor, with order n. The elements 
of the flow are given in real time to the algorithm, one by one. In the following, 
we shall call grouping a series of elements of the flow, consecutive, and bounded 
by the rank of the first and the last element. A grouping is also associated a 
qualitative type, i.e. an element of Q, written QT{g) for a grouping g. 

Two sets of groupings are used in this algorithm: RES and active. Initially, 
they are empty sets. As the algorithm is run, RES will be filled with the group- 
ings already built (i.e. whose final rank is strictly smaller than the rank currently 
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being processed). The groupings in RES are closed groupings. In active, we find 
the groupings currently being built. 

To see how it works, let us suppose the algorithm has already started and is 
in progress: 



RES 



current 
rank examined 

active 



. 3.1 . 

a2 



prevj 




prev,^ 



(example with n=2) 



The RES list has three groupings. The active list contains oi and 02 . e is the 
current fiow element that has to be integrated, prevk is the element located k 
places to the left of e {prevo is e itself, previ is the previous element, added just 
before e). First, the algorithm builds a small grouping, h, whose extension ranges 
from prevn to e and with qualitative type T{preVn,preVn-i, ...,previ, e). Then, 
for each grouping a in active, a is extended to include e if QT{a) = QT{h). If not, 
then a new grouping is created, with the extension of a plus e and the qualitative 
type of a generalized with the qualitative type of h, according to G. This new 
grouping is called g. If QT{a) = QT{g), a is replaced by g in the active list. 
Else, a moves from active to RES since it cannot be extended without changing 
its qualitative type. If there is no grouping in active with the same qualitative 
type as g, then g is to be added to active. We proceed through the entire active 
list in this way. If there is no grouping a such that QT{a) = QT{h), h is added 
to active. 

At the end, the algorithm just copies the groupings of active into RES and 
then stops. 

4 Application 

4.1 Problem 

Our main application is related to action perception. We give the system a series 
of video sequences showing simple basic actions like “take”, “push” or “pull”. 
This is unsupervised learning since the sequences are not labeled and, for this 
reason, we have here a different point of view from I3E1IZ]. To be able to learn 
from video sequences, there are two main stages: first to create a qualitative fiow 
from the image fiow and then to extract regular patterns from this qualitative 
fiow. In this article we will focus on the first part mainly, although we shall 
briefly present an outline of the second part at the end. 
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Note that our preoccupation is not really image segmentation or artificial 
vision. Therefore, the objects in the video sequences are very easy to recog- 
nize, clzarly colored on a black background. This simplification allows the use of 
simple, yet robust algorithms to do the image segmentation. 



4.2 Qualitative Descriptors Used 

As already mentioned, qualitative descriptors can be connected one to the other. 
The figure below presents the structure we use for the qualitative analyzer: 




The qualitative descriptors used are described below and the Input, Output 
and Transfer functions given in detail: 



Dl: Extend Matrix Input: the raw image fiow 

Ouput: an occupation grid where every pixel of the image is associated with 
a number corresponding to the label of an object. 

Transfer function: image segmentation. The descriptor extracts the number 
of objects and their spatial extension. Note that the image is made up of clearly 
colored objects that are easy to distinguish. A simple region-growing algorithm 
is used. 



D2: Touch Input: an occupation grid from Dl. 

Ouput: a predicate indicating what objects are touching. The objects are 
named with the labels from the occupation grid. The predicate is touch. 

Transfer function: uses the occupation grid to detect two touching points 
(with tolerance) with different labels. 



D3: Gravity Input: an occupation grid from Dl. 

Output: a set of positions corresponding to the gravity centers of the objects 
Transfer function: calculates the isobarycenter of the set of all points of a 
given label. 
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D4: Speed Input: a set of positions from D3. 

Output: a set of vectors corresponding to the instant speed of the objects. 
This descriptor is a the second order one. 

Transfer function: simple discrete derivative. 



D5: Follow Input: both the gravity centers and the speeds of the objects. This 
is an example of a more complex domain. 

Output: a predicate indicating which objects are following or avoiding each 
other. Predicates: follow, avoid. 

Transfer function: an object A is following an object B if its speed or the 
variation of its speed is directed in a cone pointing to the center of gravity of B. 



D6: Attract Input: a set of positions from D3. 

Output: a predicate indicating which objects are attracting, repulsing or 
equidistant. Predicates: attract, repulse, equid. 

Transfer function: an object A is attracting an object B if their relative 
distance is decreasing. 



D7: Move Input: a set of speeds from D4. 

Output: a predicate indicating which objects are moving or immobile. Pred- 
icates: move, immobile. 

Transfer function: compares the module of the speed with zero, with a toler- 
ance e. 



4.3 Tracks 

With this structure of qualitative descriptors, we only generate qualitative flows. 
To benefit from the generalization capabilities of the qualitative descriptors, we 
should use the grouping algorithm with the qualitative flow. To connect the 
algorithm to this structure of qualitative descriptors, we can plug some tracks 
on to the output of any descriptor. A track records the qualitative flow produced 
by the descriptor, then, the grouping algorithm works on it to produce useful 
qualitative groupings according to the generalization function of the descriptor. 



5 Results 



The grouping algorithm and the qualitative descriptor structure have been im- 
plemented on a Pentium 400MHz, with a Sony EVI-D31 camera for the video 
sequences. 
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Here is a typical sequence given to the system: 




We have only one image out of five here, the complete sequence is made up 
of 80 images. 

We plugged a track on D2, D6 and D7 to get the following qualitative group- 
ings: 



SEQUENCE 1 



touch (B, floor) 



•* 



touch (A, B) 



attract(A,B) equid(A,B) 

» ^ 

equid (B, floor) repulse (B, floor ) 



immobile (B) move(B) 



move (A) immobile (A) move (A) 



D2 

D6 

D7 



The names of the predicates are explicit enough, and we can recognize a 
typical “take” sequence. The objects have been given explicit labels, so we have 
here an object A taking an object B left on the floor. 

Another interesting sequence are “push” and “pull” . To be able to distinguish 
between these two sequences, we have to plug a track on D5: 

SEQUENCE 2 (push) 



•* 



touch (B, floor) 



■* 



touch (A, B) 



attract(A,B) equid(A,B) 



equid (B, floor) 



immobile (B) 






move (B) 


move (A) 


immobile (A) 




move (A) 





► 


◄ — 





follow (A, B) 



D2 

D6 

D7 

D5 
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SEQUENCE 3 (pull) 



touch (B, floor) 



touch (A, B) 





attract(A,B) equid(A,B) 














equid (B, floor) 






< 


immobile (B) 

— 


move (B) 





move (A) immobile (A) move (A) 



D2 

D6 

D7 



follow (B, A) 



] D5 



As we can see, the only difference between these actions is the fact that A 
follows B or the contrary. 



6 Extracting Abstract Qualitative Patterns 

6.1 Principle 

Our future workThe next part of our work will be to extract some regular pat- 
terns from the qualitative flow. In the previous example, the case of “take” could 
be generalized by: 

TAKE (generalization) 



touch (X, floor) 



attract (Y, X) 



equid {X, floor) 






repulse (X, floor) 



immobile (X) 



immobile (Y) 



move_seq (Y, Z) 



D2 

D6 

D7 



The difficulty lies in the fact that this qualitative sequence is lost in a wider 
qualitative flow. Therefore, we have to extract this sub-sequence from the global 
flow. This is indeed the well-known problem of pattern extraction (see BW- 
To do so, we can use a simple algorithm with complexity O(p^), p being the 
number of elementary sub-sequences to be compared one to the other. There are 
other algorithms, working in 0{p) but with a proportionality factor a exponential 
with the size of the predicates. They are not necessarily better for our particular 
problem of action perception, since the factor a may be greater than p. 

7 Conclusion 

The notion of qualitative descriptor allows the formalization of the transition 
from numerical data flow to qualitative flow. This tool is general enough to be 
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used in any situation and yet can be processed by the same algorithm in all cases. 
It creates qualitative flow and qualitative groupings, using the generalization 
capabilities of the descriptors. It has been used successfully in the example of 
action perception and produces a qualitative flow in which it seems possible to 
recognize typical “take” or “push” sequences, with simple pattern extraction 
algorithms. 

Modularity is also an advantage for the clarity and the ease of use of the 
qualitative notions involved. 

Many points are still to be studied, both in practical and theoretical terms. 
One very interesting question is the ability to learn new descriptors. Such learn- 
ing could be based on the building of macro descriptors: in the action perception 
problem, for example, once the system has identified a new action, it could create 
a descriptor for it. We would then have a “take” descriptor or a “push” descrip- 
tor. Another interesting question is to know how to choose the descriptors for 
a specific problem. The fact is that, given the type constraint on a description 
input, there are not many possibilities and therefore the right question would 
be to know if there is a “natural” set of elementary descriptors that could be 
used in any practical situation or if we have to define a new set for each appli- 
cation. They is agood chance that a “natural” set of elementary descriptors can 
be defined. This would lead to the building of a kind of qualitative descriptor 
toolbox, with possible cross references to cognitive psychology. Finally, we hope 
to build a system capable of creating its own abstraction from observation and 
if possible from interaction with its environment, reproducing thus part of the 
capabilities of a young baby. 
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Abstract. Association rules discovered through attribute-oriented induction are 
commonly used in data mining tools to express relationships between variables. 
However, causal inference algorithms discover more concise relationships 
between variables, namely, relations of direct cause. These algorithms produce 
regressive structured equation models for continuous linear data and Bayes 
networks for discrete data. This work compares the effectiveness of causal 
inference algorithms with association rule induction for discovering patterns in 
discrete data. 



1 Introduction 

Association rules discovered using attribute-oriented induction in tools such as 
DBMiner are used to express relationships among variables. However, causal 
inference algorithms discover deeper relationships, namely a variety of causal 
relationships including genuine causality, potential causality and spurious association 
[7,8]. In this paper, we describe and compare association rule generation based on 
their implementation in DBMiner [4] with Bayes net-based causal inference 
algorithms using Tetrad II [7], using a discretized contraceptive method choice 
(CMC) dataset from http://www.ics.uci.edu/~mlearn/MLRepository.html , the UCI 
Machine Learning Repository. 



2 Background 

2.1 Association Rule Generation 

Given a set of variables X, association rules describe relationships between variables 
in X. Eor example, if A and B are variables with respective values a and b, then an 
example of the structure of an association rule is: 



Given A = a ^ B = b. 
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DBMiner [4] produces associations based on confidence and support. For the rule 
above, the confidence level can be described as "given A =a, with what frequency will 
B=b and A=a It can be defined mathematically as follows [1]: 

confidence (B — > H) = support(B u H) / support(B). 

Support indicates the frequency that the association rule is true for the population 
being examined and can be defined as: 

support (B — > H) = frequency(B u H) / total-count. 

Association rules are expressed as logical rules with measures of support and 
confidence. The following table illustrates examples of rules produced by DBMiner: 



BODY 




HEAD 


SUPPORT 


CONFIDENCE 


ContraceptiveUse(x) = '2' 
AND 

WifeReligion(x) = T 




MediaExposure(x) = 'O' 


16.836 


96.498 


ContraceptiveUse(x) = '2' 
AND 

WifeReligion(x) = T 
AND 

WifeWorks(x) = T 




MediaExposure(x) = 'O' 


12.695 


96.891 



Association rules are semantically weak: there are no guarantees that association rules 
imply any deeper relationships. Finding association rules is exploratory data mining: 
rules discovered must be evaluated with caution by a domain analyst. 

2.2 Causal Inference Algorithms 

In Tetrad II [8], the choice of algorithm depends on whether the data examined is 
causally sufficient for the population, that is, whether there exist unmeasured hidden 
or latent [8,9] causal variables outside of X that explain spurious associations between 
variables in X. 

If data is causally sufficient, the PC algorithm is used. Otherwise, the Fast Causal 
Inference (FCI) algorithm is used. The PC algorithm indicates when hidden common 
causes may be influencing the relationships in X. 

Instead of logical rules, the PC and FCI algorithms find different kinds of causal 
relationships between variables A and B in X, graphically represented as follows: 

• A — B, meaning in PC that either A causes B or B causes A, but the direction is 
indeterminate; or in FCI that the variables are associated but the causal nature of 
the association is undetermined; 

• A^ B, meaning A (genuinely) causes B (common cause ruled out) 

• A B, indicating a hidden common cause (genuine cause ruled out, and that FCI 
should be used rather than PC) 

• A»-aB, meaning A potentially causes B (common cause not ruled out) 

The resulting graph represents a set of Markov equivalent Bayes networks. 

The causal inference algorithm makes strict assumptions about the data: Variables 
in X must satisfy the Markov Condition; i.e., variables can be organized into a 
directed acyclic graph so that any variable A in A conditioned on A's parents is 
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independent of all sets of variables that do not include A or its descendants [8]. The 
following factors affect whether or not the Markov Condition is satisfied by data: 

1. Values of one unit of the population must be independent of values in other units 
of the population [8]. If we are studying a contagious disease, then whether person 
B has the disease is not independent of whether person A has the disease. 

2. Mixtures in populations that result in contradictory causal connections between 
two variables violate the Markov Condition [8]. This occurs in the CMC example 
with contraceptive use (cu) and number of children (nc) [Figure 1]. For example, 
one woman may only have two children because she has previously been making 
use of contraceptives. Another woman with two children may start using 
contraceptives because she has already had as many children as she wants. 
Therefore, the direction of causation is different for these two women. 

3. Cyclical processes that reach equilibrium also violate the Markov condition [8]. 
Consider the sequence: in time t the value of A affects the value of B. Then in time 
t+1 the value of B affects the value of A. In time f+2 the value of A affects the 
value of B. This circular relationship violates the Markov condition. Supply and 
demand variables in economics frequently violate this condition. 

4. The sample must be representative of the population [8]. 

The first and last restrictions apply to any statistical technique. Methodology to 
handle (feedback) cycles in Bayes nets using temporal variables has developed to the 
point where it has been successfully applied to speech recognition problems [12]. 
However, Tetrad does allow the user to incorporate background knowledge such as 
temporal ordering [8]. In the CMC example (described below), we incorporated the 
constraint that a woman's age cannot be caused by her standard of living or work. If 
this background knowledge is not correct, results produced by the causal inference 
algorithms will not be correct. 

If continuous variables have been transformed into discrete variables, statistical 
tests of conditional independence may be incorrect [8]. However, with current 
technology, it is necessary to discretize continuous variables in order to analyze 
discrete and continuous variables together. (Wife's age was discretized for this study.) 



3 Experiment with Contraceptive Method Choice (CMC) 

3.1 Design 

Tetrad II and DBMiner were applied to sample data examining contraceptive method 
choice in Indonesia, containing 1437 complete records consisting of eight discrete and 
two continuous variables (discretized for both applications). 

3.2 Experiment Using Association Rule Generation 

Using DBMiner, association rules were generated at different levels of support and 
confidence, as summarized in the following table. While too many rules were 
generated to enumerate here, a complete list is available from the authors. 
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Support 


Confidence 


# of Rules Generated 


10% 


90% 


441 


10% 


95% 


265 


20% 


90% 


97 


25% 


95% 


49 



3.3 Experiment Using Causal Inference Algorithm 

Several variations of the experiment were performed using Tetrad. Most runs were 
performed with the threshold set to 95% assuming latent common causes. 

Figure 1 shows the results of the first run with all ten variables. Because 
contraceptive usage (cu) and number of children (nc) could have a cyclical 
dependency, thus violating the Markov condition, the experiment was repeated after 
removing nc. The new model [Figure 2] indicates that the wife's education (we), 
wife's age (wa), wife's religion (wr) and wife's work (ww) are all potential causes of 
the wife's age (wa). As none of these factors in the model could plausibly affect the 
wife's age, a third run [Figure 3] used temporal knowledge to forbid direct causal 
links to the wife's age. This model indicates that ww, we and wr share hidden 
common causes with wa. Note that these changes do not contradict Figure 2; they 
only make it more specific since potential cause indicates either genuine cause or 
hidden common cause. The model did not seem to be sensitive to changes in 
confidence levels. 

The final results produced many interesting relationships. The Tetrad algorithm 
claims a woman's level of education (we) is a genuine causal factor for her exposure 
to media (me), her level of contraceptive usage (cm), and her husband's level of 
education (he). The first two findings would be expected: the fact that a woman's 
education level is a cause of her husband's education level is less intuitive. However, 
in some societies, a woman plausibly might evaluate a potential husband by his 
education level relative to her own. The FCI algorithm also indicates the existence of 
latent variables. These might be husband's age, husband's social class, family wealth, 
family social class, etc. 

Hidden causes of a woman's age are more difficult to explain, since age is 
determined temporally before most variables. However, a husband's age may 
influence his wife's age, standard of living, husband's occupation and contraceptive 
usage if societal rules encourage older, successful men to marry younger women. 



4 Comparison of Data Mining Effectiveness 

Knowledge discovery in databases has been described as the "nontrivial process of 
identifying novel, potentially useful, valid and ultimately understandable patterns in 
data" [2]. Because the capability of a data mining algorithm can be judged by these 
criteria, they will be used to compare association rule induction with causal inference. 
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ho 




Fig 3: Inferred graph with temporal knowledge 



Legend 

B Association between A and B; either A causes 

B. B causes A or A and B have a hidden 
common cause. 

A is potentiai cause of B; i.e., either A causes B 
or A and B have a hidden common cause. 

A < — > B A and B have a hidden common cause. 

^ B ^ ^ genuine cause of B. 



Variables 



ww 


wife works 


ho 


husband's occupation 


wa 


wife's age 


we 


wife's education level 


sol 


standard of living 


he 


husband's education 


wr 


wife's religion 


level 




nc 


number of 


me 


media exposure 


children 




cu 


contraceptive usage 



Novelty and usefulness are closely related. Novelty compares the degree to which 
each of these algorithms provides new information. Potential usefulness examines the 
potential value that can be gained from applying the knowledge produced by each 
algorithm. Validity will be compared based on the reliability of the models produced 
by each of these algorithms. Finally, understandability will be analyzed based on ease 
of interpretation of the results 

4.1 Novelty 

Novelty can be compared using the model of knowledge and meta-knowledge gained 
described in [11]. In this model, knowledge is defined as information stored in one's 
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data or derived from one's data. Meta-knowledge is one's understanding of that data. 
One may "know" certain things and "not know" other things. At the meta-level, one 
may "know that you know" (YKYK), "not know that you know" (DKYK), "not know 
that you don't know" (DKDK) and "know that you don't know" (YKDK). The goal of 
knowledge discovery is to move knowledge into the YKYK category [11]. New 
information is gained if either the level of knowledge or meta-knowledge is increased. 
An extended version of this model can be used to compare the level of novel 
information produced by traditional association rules with models generated by causal 
inference algorithms [Figure 4] . 



DKYK 3 ,^ Genuine Causes 


YKYK 

> 


BN Hidden Causes I r\ 


■V 


BN Potential Causes 







BN Association 


^ ^ 

^ 


- 


^ Using Bayes 
Networks 
for Est i m at io 1 


Traditional Association Rules 


^ 






Meta-Knowledge 

BN Hidden Causey 

1 ^ 

BN Potential Causes.. 

DKDK ^ 

K n o w 1 e( 


d g e Level 


^ K D K 



Figure 4: Knowledge/Meta-knowledge Diagram 



Most of the capability of both data mining techniques involves moving knowledge 
from the DKYK category to the YKYK category [Figure 4]. Association rule 
induction creates logical rules with a head and body such as 



BODY 




HEAD 


SUPPORT 


CONFIDENCE 


WifeAge(x) = '20.00~30.00' 


==> 


MediaExposure(x) = 'O' 


36.796 


95.088 



This provides some new information, i.e., suggests a possible direction of 
causation between wife Age (wa) and MediaExposure(me) but there is no basis to 
conclude that the arrow indicates direct or even indirect causation. The FCI algorithm, 
on the other hand, clearly shows that the relationship between wife's age and media 
exposure is mediated by the wife's education. Thus, where association rule generation 
techniques find surface associations, causal inference algorithms identify the structure 
underlying such associations. 

Each type of relationship generated by the FCI algorithm provides additional 
information. Recall that the FCI algorithm finds four kinds of relationships, each of 
which deepens the users’ understanding of their data by constraining the possible 
models. For example, wa <-> wr provides more information than wr *-A wa because 
the latter indicates that either wr causes wa or that wa and wr have a latent common 
cause, whereas wa wr specifically indicates a hidden common cause. A genuine 
causal relation such as we cu provides useful information because it indicates the 
relationship from we to cu is strictly causal. 
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Most fascinating perhaps is the ability of causal inference algorithms to detect 
hidden causes; this affects meta-knowledge in the DKYK quadrant [Figure 4]. The 
user may discover entirely unexpected latent common causes of two variables. While 
the causal inference algorithms do not and can not indicate what the latent common 
cause is, such a finding directs the experimenter to find some (unmeasured) variable 
to explain an association between two variables. 

Finally, if X is causally sufficient, then one of the inferred, Markov equivalent 
models can be chosen to represent the data as a Bayes network and initial conditional 
probabilities can be estimated. The Bayes network could then be used to make 
predictions about new objects, or to predict missing values in specific instances in the 
database. This provides some ability to move knowledge from the YKDK quadrant 
to the YKYK quadrant [Figure 4]. Unfortunately, in cases such as the CMC study the 
data is not causally sufficient. 

4.2 Understandability 

Traditional association rules provide simple logical rules about associations between 
variables. Also, their associated measures of support and confidence are easy to 
understand. On the other hand, the technique produces massive numbers of rules that 
then must be sifted through to find which are both relevant and useful, a time- 
consuming and difficult task. Simply put: "you can't see the forest for the trees". 

Bayesian networks induced by causal inference algorithms provide a deeper 
understanding of the underlying structure of relationships between variables. A quick 
glance at the Bayesian network for the contraceptive study reveals the structure of 
these relationships. On the other hand, the formal definitions of spurious association, 
potential and genuine causality [6,7] may tax the casual reader or require a leap of 
faith as to their meaning. 

If the graph is not causally sufficient, as is the case with the CMC study, then 
Tetrad produces no estimates of conditional probabilities. Therefore, while one can 
see that the wife's education is a genuine cause of contraceptive usage, it is not 
apparent whether a high education causes more or less contraceptive usage. In this 
case the association rules can provide more granular information about the 
relationship between variables. (On the other hand, these conditional probabilities are 
not hard to compute from the original data.) 

4.3 Usefulness 

Attribute-oriented induction of association rules is primarily useful for exploratory 
data mining. The rules produced provide information that can be used to predict the 
value of other variables. For example: 



BODY 




HEAD 


SUPPORT 


CONFIDENCE 


WifeEducation(x) = '4' AND 
Standard of Living(x) = '4' 


==> 


HusbandEducation(x) = 
'4' 


25.39 


96.891 
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From this rule, if you know that the wife's education level is 4 and her standard of 
living is 4, then 96.891% of the time the husband's education level will equal 4. You 
also know that this rule occurs in 25.39% of all instances in the database so the rule is 
not likely a statistical anomaly. However, the rule does not indicate the structure of 
the relationship underlying that rule. The causal graph produced by the FCI algorithm 
indicates that the wife's education has a direct effect on the husband's education but 
the husband's education and standard of living have an unknown common cause. 
Therefore, the association rule suggests that manipulating the standard of living could 
affect the husband's level of education. However, the causal graph, provided it is 
correct, suggests the standard of living would not affect the husband's education level 
but that some unknown third variable (parentage?) could affect both the husband's 
level of education and standard of living. 

If the graph is not causally sufficient, it is primarily useful for determining where 
hidden common causes and genuine causes exist. For social policy planning, it is 
important to know that a woman's level of education directly affects contraceptive 
usage. Therefore, policy makers know that by changing the level of a woman's 
education, they can affect contraceptive usage. (The FCI algorithm does not in itself 
quantify that causal relationship; however, once causal directions are specified, 
conditional probabilities are easily estimated.) 

4.4 Validity 

Little can be said about the validity of association rules. They neither assume nor 
impose real restrictions on the data. As well, the measures of confidence and support 
are simple thresholds for evaluating a rule and provide a crude "first cut". The task of 
determining whether or not rules are meaningful is left to the analyst. 

Causal inference algorithms have substantially more constraints on the data use 
and make stronger claims regarding validity of the model produced. The significance 
level for these algorithms is used to find partial correlations between variables as part 
of the PC or FCI algorithm. However, this value does not translate to a statistical 
measure of confidence for the model produced. 

There are several ways to test a graph's validity. For example, if altering the 
significance level up or down significantly changes the graph's topology, it is less 
dependable. (In the case of the CMC data, such alterations produced few changes.) 
One can also try to determine the probability of certain types of errors, for example, 
finding spurious edges (edge commission) or arrows (arrow commission), or not 
finding "real" edges (edge omission) or arrows (arrow omission) [8]. 

If the graph is causally sufficient and conditional probabilities are estimated for 
one possible model, then Monte Carlo simulations can be used to generate data sets 
that represent the model's structure. By applying the PC algorithm to such data sets 
and comparing the results with the original graph, probabilities for each of the 
preceding error types can be calculated. 

There are other general findings on factors that effect validity of models produced 
by causal inference algorithms [8]. Discrete variables with few possible values are 
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more likely to result in spurious correlations. Therefore, it is better if discrete 
variables have at least three outcomes. Numerous Monte Carlo simulations were 
performed on regressive structured equation models with continuous variables. This 
experimentation examined how sample size and average degree of a vertex (average 
number of nodes adjacent to the vertex) affected reliability [8]. The most common 
error tended to be arrow commission, but low sample sizes and high average vertex 
degrees also adversely affected reliability. There tended to be few edge commission 
and omission errors. Also, arrow commission and omission rates decreased 
dramatically when sample sizes were greater than 2000. Our experiments confirmed 
that these trends also applied to causal inference on discrete data. 

Lastly, there has been a strong debate in the academic community over the validity 
of inferring causality [3,5,10]. Freedman and Humphreys [3,5] have criticized the 
causal inference algorithms both on the underlying principles and their effectiveness 
on real data. However, they concede that "The mathematical development has some 
technical interest and the algorithms could find a limited role as heuristic devices for 
empirical workers." The issue of whether or not such algorithms can perfectly 
discover causal structure may not impact their usefulness as exploratory data mining 
tools. Certainly, if association rules have been found to provide a useful first cut of 
exploratory information, causal inference algorithms can certainly provide an 
interesting second cut. 



5 Conclusions 

Causal inference algorithms produce fewer and more concise relationships between 
variables than association rules. They reveal underlying causal structure among 
variables, not just apparent surface associations. The ability to reveal the existence of 
hidden common causes outside of known data is a promising tool. With causally 
sufficient data, conditional probabilities can be generated that allow calculation of 
causal effects of a change on one variable on other variables in the set. This provides 
much more useful functionality than association rules. Third, the causal structure 
revealed by the Bayes networks combined with the minimal set of conditional 
probabilities is clearer than numerous association rules. Finally, the more rigid 
statistical basis for causal inference leaves less to the user to determine which 
discovered relationships are valid and which are not. 

A minor concern with Tetrad and causal inference algorithms is that Tetrad does 
not generate conditional probabilities unless the variables are causally sufficient. 
Therefore, the user will know the causal structure but will not have any of the 
quantitative details of that relationship. Providing the user with conditional 
probabilities for causally insufficient graphs would be helpful. (While these 
probabilities could not be used to make estimates of probabilistic causal effec[J the 



* Probabilistic causal effect of S on A is the probability of some outcome of A given that B is 
set to some value, as opposed to the probability of A given B is observed to have some value. 
This is easy to compute from a causally sufficient graph]?]. 
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user would better understand quantitative relationships in the graph.) That said, this 
information would not be difficult to add. 

While the results of causal inference of Bayes networks are more understandable 
than association rules, the underlying algorithms and theory are more complicated. 

High arrow commission and omission rates raise concerns about relying on the 
direction of causality, but this is tempered by the relatively infrequent edge omission 
and commission errors. As well, allocation of rules as the "head" or "body" of an 
association rule is also not reliable as an indicator of causal direction. 
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Abstract. Described here is a hrst attempt to classify citations accord- 
ing to function in a fully automatic manner, that is, complete journal 
articles in electronic form are input to the citation classifier and a set 
of citations with their suggested function (chosen from a previously pro- 
posed scheme of functions) is output. The description consists of a brief 
introduction to the classification scheme, a description of the classifier, 
and a summary of the results of a test of the classifier on real data. 



1 Introduction 

One method to enable efficient retrieval of documents is to index them in a cita- 
tion index. A citation index consists of source items and their corresponding lists 
of bibliographic descriptions of citing works. A citation connecting the source 
document and a citing document serves one of many functions. For example, one 
function is that the citing work gives some form of credit to the work reported in 
the source article. Another function is to criticize previous work. When using a 
citation index, a user normally has a more precise query than “Find all articles 
citing a source article” . Rather, the user may want to know whether other ex- 
periments have used similar experimental techniques to those used in the source 
article, or whether other works have reported conflicting experimental results. 

In order to use a citation index in this more sophisticated manner, the citation 
index must contain not only the citation link information but also must indicate 
the function of the citation in the citing article. The function of the citation 
must be determined using information derived from local and global cues in the 
citing article. 

We describe here a first attempt to classify citations according to function 
in a fully automatic manner, that is, complete journal articles in electronic form 
are input to the citation classifier and a set of citations with their suggested 
function (chosen from a previously proposed scheme of functions) is output. This 
description consists of a short introduction to the classification scheme (which 
is described in detail in [7]) a description of the classifier, and a summary of the 
results of a test of the classifier on real data. 

2 The Problem 

A citation in the citing document is a portion of a sentence which references 
another document or a set of other documents collectively. For example, in the 



H. Hamilton and Q. Yang (Eds.): Canadian AI2000, LNAI 1822, pp. 337 484(31 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 
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sentence given below, there are two citations. The first citation is “Although 
. . .progress” and is anchored by the “(Eger et ah, 1994; Kelly, 1994)” reference 
set. The second citation is “it was . . . submasses” which is anchored by “(Gough- 
ian et ah, 1986)”. Sometimes the anchor reference is part of the citation itself 
(e.g. “Smith (1996) also proposes . . . ”). 



Although the 3-D structure analysis by x-ray crystallography is still in 
progress (Eger et ah, 1994; Kelly, 1994), it was shown by electron mi- 
croscopy that XO consists of three submasses (Goughian et ah, 1986). 



If a user uses a citation index to retrieve documents, a large number of irrel- 
evant documents could be retrieved because of the many functions of a citation 
(e.g. paying homage to pioneers, use of equipment, use of technique, conflicting 
and supporting results, etc.). Given the standard definition for retrieval precision, 

. . # relevant documents , , , . . , . , -r 

precision = -rb-T 1 j, it seems that precision can be increased if 

# documents retrieved’ 

the documents retrieved were restricted to match the user’s requirements more 
closely. For instance, restricting the documents retrieved to a user’s need to find 
other articles that report using the same experimental technique as the source 
document could be achieved if the function of the citation, “Use of technique”, 
were used in the retrieval request. The ability to restrict the retrieval in this 
manner would require that the function of the citation be added to the index. 
In order to accomplish the addition of the citation function to the index it is 
first necessary to classify citation function. Given the amount of data, automat- 
ing this procedure is required. Prior to classification, a classification scheme is 
needed. The differentiation power of the classification scheme will be reflected 
in the precision of the search. 



This, then, motivates our problem: develop a citation classification scheme 
for automatic classification, which is more comprehensive than any previously 
proposed and implement an automated citation classifier based on this enhanced 
scheme. In implementing a fully automated citation classification system proto- 
type, we have obtained some early results that would indicate that the solution 
that we discuss in more detail below holds enough promise to warrant further in- 
vestigation. Several citation classification schemes have been amalgamated into 
a comprehensive citation classification scheme containing 35 categories. A prag- 
matic grammar, consisting of 195 lexical matching rules and 14 parsing rules, 
has been developed to classify citations based upon a citation’s cue words and 
location in the article. An automated citation classifier which classifies the cita- 
tions in a biochemistry or physics article has been built. The performance of the 
automated citation classifier on previously seen and previously unseen articles is 
known: good performance of the automated citation classifier on the previously 
seen set of articles, fair performance of the automated citation classifier on a 
previously unseen set of 6 articles. 
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3 Background 

Classification of citation contexts includes the use of cue words (for instance, 
“still in progress” in the citation example above indicates that the citation is 
referring to work of a concurrent nature) and verb tenses (for instance, “was 
shown” indicates that a key result is discussed in this previous work). The sur- 
rounding context (the preceding and following sentences) as well as the location 
of the citation in the citing paper also play crucial roles in determining the 
function of a citation. 

There have been a number of citation classification schemes proposed in the 
citation literature. Garfield [6] was the first to define a classification scheme, 
albeit without that purpose in mind. For our purposes, Finney [4] was the first 
to suggest that a citation classifier could be automated, but her system falls 
short because of the small number of citation categories 0 We are unaware of an 
implementation of her system. We have, nonetheless, expanded on her original 
ideas (associating cue words with citation function and using citation location 
in the classification algorithm) to produce an automated citation classifier. 

A number of schemes (including the two previously mentioned) have been 
proposed: Cole [2] — 9 categories; Duncan, Anderson, and McAleese [3] — 26 
categories; Finney [4] — 7 categories; Frost [5] — 15 categories; Garfield [6] 
— 15 categories; Lipetz [8] — 9 categories; Moravcsik and Murugesan [9] — 4 
dimensions; Peritz [10] — 9 categories Small [11] — 5 categories; Spiegel- Rosing 
[12] — 13 categories; and Weinstock [14] — 15 categories. We have proposed a 
citation classification scheme that is more comprehensive than the union of all 
of the previous schemes. This scheme is discussed in detail in [7]. However, in 
order to provide some reference and to show the breadth and granularity of the 
classification scheme, a listing of the 35 categories is given in the Appendix. 

4 System Overview 

In summary we have implemented a citation classifier system that aims to be a 
100% automated prototype. Journal articles (currently, only biochemistry and 
physics) are input to the classifier system and for each citation in the article, one 
of the functions from the classification scheme (see the Appendix) is assigned. 

All articles used in this study are in electronic form. Those articles which are 
in postscript were translated using pstotext. Formula translation requires some 
manual intervention, but this is the only part of the prototype system that is not 
fully automated. The classifier requires some simple syntactic information NPs, 
head nouns, PPs, verbs, etc. To derive this information, we used Strzalkowski’s 
Tagged Text Parser [13] which requires the use of a tagger, in our case, the 
Brill tagger [1]. The output of the parse could contain false information (such 
as that caused by ambiguities) and the TTP can give partial parses (time out, 

^ The quality of a search which is aided by being able to request citation function is 
directly related to the comprehensiveness (breadth and granularity) of the citation 
classification scheme. 
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(Category 24/Results Cue Words) := postulated | reads | reported 



Fig. 1. Example of cue words. These cue words, if found in the “Results” section 
of the journal article, would indicate that the citation should be classified as 
category 24 



parenthetical remarks not being handled correctly, and an impoverished lexicon) . 
Later components in the classifier system allow for this noise. Also, some errors 
in classifying are caused by the noise. 

Only the sentences containing the citations were extracted from the text 
for further analysis 0 One of the outcomes of the experiment was a recognition 
of the important role that the preceding and following sentence could play in 
classification. However, the inclusion of these sentences as an information source 
used by the classifier is left for future work. The details of how the boundaries of 
a citation are determined is beyond the scope of this paper. But it is important 
to realize that the citation is part of the sentence that the citation occurs in and 
is anchored by a reference or reference set. 

For the purposes of this paper, the most important part of the classifier 
system is what we have called the pragmatic parser. The knowledge used by the 
parser is contained in a pragmatic grammar0 This grammar has been developed 
by manually extracting and studying citations from 14 journal articles (8 physics 
and 6 biochemistry) that we have called the design set. The grammar rules reflect 
this purpose. The rules are of two types: lexical rules based on cue words which 
have been associated with functional properties; and grammar-like rules which 
allow more sophisticated patterns to be associated with functional properties. 

We provide an example of cue words and a grammar rule in Figures [1] and 
Figure [H provides the list of cue words which, if found in the “Results” section 
of the journal article, would indicate that the citation should be classified as 
category 24 (Used for developing new hypothesis or model.) 

Figure[2]shows a grammar rule, (par-1), together with other supporting gram- 
mar rules in Extended BNF that would indicate one of the “use of A” categories 
18, 19, 20, 21, 22, and 27 (see the Appendix for details) if it matched the cita- 



^ The section in which the citation is found is also recorded. The sections for these two 
types of scientific journals are: introduction, methods, results, and discussion. Some- 
times, the last two sections are combined. Also, the physics articles tend to be less 
stratified than the biochemistry, leading to poorer automatic section categorization 
for the physics articles. 

® Our choice of the term “pragmatic grammar” (and hence “pragmatic parser”) has 
been motivated by the existence of semantic grammars where specialized lexical 
categories are based on their semantic properties. Some constituent categories have 
been motivated by the function of the constituent in this particular domain of citation 
classification in scientific journals. The purpose of the pragmatic grammar is to 
suggest the function of a citation. 
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tion. The category is determined by which head category is matched. (The rule 
presented here is a slight simplification of the rule that has been implemented. 
The actual rule also allows for “not” but this complicates the presentation and 
this aspect has been left out of the current discussion.) 



(par-1) := (usage-verb-1) (head-modifier) (head- 1) 

(head-1) := (equipment-head=18) | (equation-head=19) | (method-head=20) 

I (conditions-head=21) | (analysis-head=22) | (definition-head=27) 

I (numeric-data-head=28) 

(usage-verb-1) := use | using | uses | introduce | introducing | introduces | apply 
I applying | applies | obtain | obtaining | obtains | follow | following | follows | 
introduce | introducing | introduces | consider | considering | considers 
(head-modifier) := any part of speech which modifies the head noun, e.g. de- 
terminers, adjectives, prepositional phrases, etc. 

(equipment-head) adaptin | albumin | apparatus | applications | arrays | 
atcc I bcf I body | borohydride | brains | buses | calorimeter | cell | cells 
I chymotrypsin | coli | complexes | concentration | concentrations | crude | 
decarboxylase | deformation | electron | enzyme | equipment | extract | factor 
I fluorouracil | fractions | furin | inhibitor | jer | kit | ktao | ktn | 131 | lambda- 
dash I liver I livers | medium | melt | membranes | mixture | mprs | mtx | 
muons I ng | opti-mem | overlay | p-tefb | pattern | pbluescript | pcmv | potv 
I peptide | phosphate | pkd201 | plant | plants | primer | primers | probe | 
protein | rabbit | radiation | ray-sums | reactions | samples | scintillators | 
software | spectrometer | strain | strains | structure | substituents | substrates | 
technology | template | thymus | tomograph | tool | ultrasound | unit | vesicles 
I w28 I zone | zones 

(equation-head) := algorithm | components | curve | distribution | eigenvalues 
I eq I equation | expression | expressions | field | functions | g | i | k | lemma | 
model I models | notation | number | product | proof | relation | representation 
I s I systems | term | terms | theorem | theorems | theory | value 
(method-head) := assays | blot | blots | change | chromatography | classification 
I determinations | electrophoresis | electroporation | experiment | extraction | 
extrapolation | geometry | hpai | hybridization | hydrolysis | image | integra- 
tion I least-squares | method | methodology | mutagenesis | operation | p-11 | 
parametrization | preincubation | principle | procedure | procedures | process 
I processing | protocol | reaction | reduction | restoration | sds-page | separa- 
tion I spectrometry | spectrophotometry | synthesis | technique | techniques | 
tomography | transfection | transformation | work | zymography 
(conditions- head) := condition | conditions | intensities | levels | light | pressure 
(analysis-head) := analyses 
(definition-head) := definition 
(numeric-data- head) data | measurements 



Fig. 2. Example of grammar rules. If rule (par-1) matches the citation, then one 
of categories 18, 19, 20, 21, 22, and 27 is suggested 
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Close scrutiny of the rules reveals that some obvious linguistic elements could 
have been added, e.g. some simple morphological analysis to accept plurals. 
These sorts of techniques would add to the robustness of the system. Their 
addition has been left for future work. 



5 Results of an Evaluation 

The citation classifier was tested on two sets of articles: the design set which is 
composed of the 8 physics and 6 biochemistry articles mentioned in the previous 
section and a test set, 3 physics and 3 biochemistry articles which have been 
previously unseen. The classifier produced the citation function for each cita- 
tion and this was compared to the function produced by a human classifier. The 
results were tabulated considering the different types of outcomes. Table E] is a 
representative example of the correctness of the function assigned by the classi- 
fier to citations extracted from previously unseen biochemistry articles. Table 0 
presents a variety of “types of correctness/incorrectness” of the citation analysis 
output. The various correctness types are outlined below. 

Two overriding aspects are important in the interpretation of the categories. 
First, the human classifier may decide that a citation can be in more than one 
category. Thus, the citation is assigned a set of categories. The correctness of 
the citation assignment set is classified based on the correctness of each citation 
assignment in the set as follows. A citation assignment set is classified as being 
completely right if all of its citation assignments are correct or extra correct. A 
citation assignment set is classified as being partially right if at least one of its 
citation assignments is correct and at least one is incorrect, extra incorrect, or 
missing. A citation assignment set is classified as being completely wrong if all of 
its citation assignments are either incorrect, missed, or extra incorrect. Second, 
when the pragmatic parser does not have any rules to make a classification for 
a citation, the most frequently occurring category for the section in which the 
citation is found is assigned as the default classification. 

A “Correct” assignment is one in which a correct assignment is made by 
only one pragmatic grammar rule. A “Default Correct” citation assignment is 
a correctly assigned default category. A “Correct x 2” citation assignment is 
made when both members of the citation assignment set match the two mem- 
bers of the human classifier assigned categories. A “Correct -I- Default Correct” 
indicates that one of the human classifier categories is matched by a pragmatic 
grammar rule and the other is matched by a default match. (An attempt is made 
to match each member of the human classifer set. So, it is possible to have one 
member matched with a rule and one matched with a default assignment.) A 
“Correct -|- Extra Correct” citation assignment is made when more than one 
rule matches a citation, repeating an already assigned category. A “Correct x 
2 -I- Extra Correct” is just the obvious combination of previous categories. A 
“Correct -|- Missing” classification by the automated classifier means that one of 
the human classifications is matched by the automated classifier and one cannot 
be matched by default (the section value is not available). A “Correct -I- Extra 
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Table 1. Representative classification results by correctness type for unseen 
paper. 



Correctness of 


Correctness 


Nnmber Percent 


classification 


type 


found of total 


Completely Right 


Correct 


19 


29.2 




Default Correct 


9 


13.8 




Correct x 2 


1 


1.5 




Correct -1- Defanlt Correct 


1 


1.5 




Correct -I- Extra Correct 


3 


4.6 




Correct x 2 -I- Extra Correct 


1 


1.5 


Partially Right 


Correct -|- Missing 


4 


6.1 




Correct -I- Extra Incorrect 


3 


4.6 




Correct -I- Incorrect 


3 


4.6 




Correct -1- Extra Incorrect -|- Incorrect 


1 


1.5 


Completely Wrong Incorrect 


12 


18.5 




Default Incorrect 


5 


7.7 




Default Incorrect x 2 


1 


1.5 




Missing 


1 


1.5 



Incorrect” means that the human classification is matched by one pragmatic 
rule, but another pragmatic rule generates an incorrect suggestion. A “Correct 
+ Incorrect” means that one of the categories is correctly matched and the other 
one is not given a correct match. “Correct + Extra Incorrect + Incorrect” is the 
obvious combination of previous categorizations. An “Incorrect” citation assign- 
ment means that the automatic classifier gives an incorrect non-default result. A 
“Default Incorrect” citation assignment is an incorrectly assigned default cate- 
gory. “Default Incorrect x 2” means that both categories are not matched by the 
default suggestion. “Missing” means that no pragmatic rule is applicable and 
the section value is not available, so no assignment is made. 

The average percentage of the total number of completely right, partially 
right and completely wrong citation assignment set classifications are respec- 
tively 78%, 11%, 11% for the design set physics articles, and 84%, 8%, and 8% 
for the design set biochemistry articles. The average percentage of the total num- 
ber of completely right, partially right and completely wrong citation assignment 
set classifications are respectively 41%, 21%, 38% for the test set physics articles, 
and 61%, 12%, and 27% for the test set biochemistry articles. 

The classifier did more poorly on unseen physics articles because the less well- 
defined article structure (section information is important) of the physics articles 
was not correctly determined as often as in the structurally more well-defined 
biochemistry articles. The determination of section (which is done automatically 
by the classifier) is one aspect which needs improvement. 

Part of the reason for the (anticipated) poorer results on the unseen articles 
than on the design set articles can be directly attributed to the impoverished 
lexicon. The lexicon used for the test on the previously unseen articles was 
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developed solely from the text found in the design set citations. Our objective 
was a proof of concept, and we believe that we have succeeded with this objective. 
The next step would be to monitor the improvement in the classification given 
incremental improvements to elements such as the lexicon. 



6 Conclusions and Future Work 

Several citation classification schemes have been amalgamated into a more com- 
prehensive citation classification scheme containing 35 categories. A pragmatic 
grammar, consisting of 195 lexical matching rules and 14 parsing rules, has been 
developed to classify citations based upon a citation’s cue words and section 
location. An automated citation classifier which classifies the citations in a bio- 
chemistry or physics article has been built and tested on real journal articles. 
Good performance of the automated citation classifier on the design set of ar- 
ticles was achieved. Fair performance of the automated citation classifier on a 
previously unseen set of 6 articles was obtained. 

Future work includes increasing the precision of the classifier by improving 
the knowledge that it relies on (which obviously includes increasing the scope 
of the lexicon and the grammar rules — use of appropriate machine learning 
techniques could help to augment the lexical and grammar rules — and improv- 
ing the section determination algorithm — a study of style could provide some 
important knowledge, but also includes investigating the role of verb tense in 
the pragmatic grammar) . It would be most advantageous to extend the scope of 
the classifier to other scientific and non-scientific fields. The ultimate aim is to 
incorporate the classifier into a full citation system. 
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A List of Citation Categories 

Negational Type Categories 

1. Citing work totally disputes some aspect of cited work. 

2. Citing work partially disputes some aspect of cited work. 

3. Citing work is totally not supported by cited work. 

4. Citing work is partially not supported by cited work. 

5. Citing work disputes priority claims. 

6. Citing work corrects cited work. 

7. Citing work questions cited work. 

AfRrmational Type Categories 

8. Citing work totally confirms cited work. 

9. Citing work partially confirms cited work. 

10. Citing work is totally supported by cited work. 

11. Citing work is partially supported by cited work. 

12. Citing work is illustrated or clarified by cited work. 

Assumptive Type Citations 

13. Citing work refers to assumed knowledge which is general background. 

14. Citing work refers to assumed knowledge which is specific background. 

15. Citing work refers to assumed knowledge in an historical account. 

16. Citing work acknowledges cited work pioneers. 

Tentative Type Categories 

17. Citing work refers to tentative knowledge. 
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Methodological Type Categories 

18. Use of materials, equipment, or tools. 

19. Use of theoretical equation. 

20. Use of methods, procedures, and design to generate results. 

21. Use of conditions and precautions to obtain valid results. 

22. Use of analysis method on results. 

Interpretational/Developmental Type Categories 

23. Used for interpreting results. 

24. Used for developing new hypothesis or model. 

25. Used for extending an existing hypothesis or model. 

Future Research Type Categories 

26. Used in making suggestions of future research. 

Use of Conceptual Material Type Categories 

27. Use of definition. 

28. Use of numerical data. 

Contrastive Type Categories 

29. Citing work contrasts between the current work and other work. 

30. Citing work contrasts other works with each other. 

Reader Alert Type Categories 

31. Citing work makes a perfunctory reference to cited work. 

32. Citing work points out cited works as bibliographic leads. 

33. Citing work identifies eponymic concept or term of cited work. 

34. Citing work refers to more complete descriptions of data or raw sources of 
data. 
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Abstract. We consider the problem of learning classifiers with a small 
labeled example set and a large unlabeled example set. This situation 
arises in many applications, e.g., identifying medical images, webpages, 
sensing data, etc. where it is hard and expensive to label the exam- 
ples while it is much easier to acquire unlabeled examples. We suppose 
that the training data is distributed in the mixture model with Gaus- 
sian components. An approach to selecting typical examples for learning 
classifiers is proposed, and the typicality measure is defined with respect 
to the labeled data according to the Mahalanobis squared distance. The 
algorithm for selecting typical examples is described. The basic idea is 
that a training example is randomly drawn, and its typicality is mea- 
sured. If the typicality is greater than the threshold, then the training 
example is sampled. The number of typical examples sampled is limited 
to memory capacity. 

Keywords: Data reduction, classihcation, machine learning. 



1 Introduction 

Many approaches to data reduction have been proposed and implemented for 
machine learning and data mining. Basically, two categories of algorithms have 
been investigated and developed extensively. One approach focuses on the re- 
duction of features which are used to represent the original data, called feature 
selection [S] . The goal is to reduce the data dimensionality by selecting important 
attributes without or with less loss of information behind the data so that the 
data can be expressed in some succinct formation and the data size is decreased 
to the extent that the algorithms can efficiently execute in the main memory. 
The other category addresses example selection, or tuple selection which selects 
or searches for a representative portion of the original data that can fulfill a 
learning task as if the whole data set is exploited. 

Training examples such as spectrabands in remote sensors, text documents 
in World Wide Webs, etc. are usually very expensive and time-consuming to 
acquire. Identifying the class labels of these training data must be taken by con- 
sulting experienced analysts and/or by other means. Obtaining unlabeled data, 
however, is comparatively easy. For example, text documents can be searched 
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by crawlers [2], and sensing data can be obtained automatically by means of 
inspection of the sensors [13]. From the point of view of statistics, both the la- 
beled and unlabeled data contain useful information for classifying the future 
data, although the labeled data contains more information than the unlabeled 
data. In the case of lack of sufficient labeled data to form and train the classifier, 
we expect to use unlabeled data to enhance the estimates of the parameters 
required and obtain a much more accurate classifier. 

The general approach to design classifiers with both labeled and unlabeled 
data is based on the EM algorithm I2I6I11I13I . The huge number of unlabeled 
data, which may be obtained automatically and very easily, must be used to 
compute the expectation and maximization of the likelihood function again and 
again during the iteration of the EM algorithm until the algorithm terminates 

In the case that the unlabeled data cannot be stored in main memory, the 
algorithm must access the disk again and again, which is time-consuming and 
inefficient. On the other hand, the unlabeled data is randomly acquired and may 
not be the representatives of the classes, thus may cause the EM algorithm to 
converge very slowly. This paper contributes to selecting typical examples from 
both the labeled and unlabeled data for the EM algorithm so that the classifiers 
can be efficiently formed. 

Suppose that the mixture model consists of Gaussian components. The means 
and covariances can be estimated on the basis of the labeled data. The labeled 
and unlabeled examples that are close to the mean and have small deviations 
are likely to be the typical examples of the component. 

We introduce related work in Section 2. The typicality measure of labeled 
and unlabeled data with respect to the Gaussian mixture model is investigated 
based on the Mahalanobis distance |H] in Section 3 and Section 4, respectively. 
The algorithm for selecting typical labeled and unlabeled examples is presented 
and the run time complexity is analyzed in Section 5. Finally, in Section 6 we 
present concluding remarks. 



2 Related Work 



Example selection techniques can be divided into two main classes: parametric 
techniques and non-parametric techniques [1]. The former assumes a parametric 
model for the data distribution, like Gaussian or exponential distributions, and 
then estimates the parameters of the model, while the latter does not assume 
any model for the data. The approach that we consider in this paper uses a 
parametric technique, which assumes specific kinds of distributions of the original 
data, and estimates the parameters. After the parameter estimate is obtained 
the typical data, with respect to the parametric model, can be drawn according 
to their typicality measure, which forms the representative portion of the original 
data. Actually, learning algorithms such as decision tree, neural network, and 
other classification and/or clustering algorithms [T2] can operate on the typical 
sample. 
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Breiman considers the problem when the original data is too large to hold 
in memory of any currently available computer, and proposes an approach to 
solving the problem which divides the data set to many small subsets such that 
each subset can be held in the main memory, learns classifiers based on the data 
subsets, and finally combines these classifiers to form an aggregation classifier 
that has higher accuracy [2- Fukunaga and Mantock present a nonparametric 
approach to data reduction which is used to select samples that are represen- 
tatives of the entire data set. The technique is iterative and is based on the use 
of a criterion function and nearest neighbor density estimates. Kah-Kay Sung 
discusses the problem of active example selection for function approximation 
learning 1141 . which is related to our work. Barbara et al. survey the techniques 
of data reduction from the database perspective pp, including singular value de- 
composition, discrete wavelet transform, linear regression, histogram clustering, 
index trees, etc. None of these techniques considers the situation of learning a 
classifier with labeled and unlabeled examples. 

Another closely related work is on outlier detection [7], which attempts to 
detect outliers for cluster analysis, where outliers are atypical data for clustering 
and viewed as noise data. The difference between outlier detection and our typi- 
cal example selection is that our method only selects more typical examples than 
others due to the memory capability and the learning efficiency, not because the 
others are outliers. 

In this paper the finite mixture model of distributions is assumed to exist in 
the original data m- Hidden in the parametric technique is the problem that 
choosing an appropriate model is an art, and the model may not always do well 
with any given data set [Ij. 

3 Typicality Measure of Labeled Examples 

It has been pointed out that convergence with the EM algorithm is generally 
quite slow [T7M] . Thus, the observed data must be iteratively used to estimate 
the parameters. In the case where the number of unlabeled data is very large as 
in most practical situations, the EM algorithm is inefficient computationally. If 
the original data is stored on disk, then the disk must be scanned many passes 
until the termination condition is satisfied. 

In order to improve the performance of the EM algorithm, we expect that 
the data used in the EM algorithm could be stored in main memory. Sampling 
is a useful method to reduce data size. Different sampling schema may provide 
different biases to different objectives m In our case, a random sample is not a 
good choice since the data distribution has been assumed to have a special form, 
e.g., a normal distribution. This assumption can be viewed as a prior knowledge 
and could be biased to the estimation of the parameters. 

Given the instance set, or multivariate observations, D, which consists of 
two parts D = {X,Y}, where X = {xi, X 2 , . . . , x„} is the unlabeled subset 
of size n, and Y = where Y^ = {yu,yi 2 , ■ ■ ■ is the labeled 

subset of size m = X^i=i g is the number of the classes to be learned. 
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yij,j = 1 , 2 , . . . , TOi, are labeled as the i-th class label which is from the label 
set C of size g. x^, i = and , i = 1, 2, . . . , g, j = 1, . . . , rrii, are p- 

dimensional vectors, and the class labels are from the label set C = {1,2, ..., 5 }. 

Two assumptions must be made. First, D is produced by a mixture model 
consisting of g components and each component can be expressed as a probabil- 
ity distribution. Under the finite mixture model, each observation can be viewed 
as arising from a population G which is a mixture of a finite number, say g, of 
subpopulations, or components, Gi, C? 2 , • ■ • , Gg. To exploit a probabilistic model 
to design a classifier, secondly, it is assumed that there is a 1-1 correspondence 
between generative mixture components G and classes G, and each class dis- 
tribution can be described in a certain component. Thus, for simplicity, the g 
classes will be denoted as Gi, G 2 , . . . , Gg, respectively, in what follows. 

To measure the typicality of examples under the condition that the mixture 
components are all normal, we introduce the concept of the Mahalanobis squared 
distance on the basis of the classified data Y. In this section, we define the 
typicality of classified data ,i = 1,2, . . . , g, j = 1,2, . . . ,rrii, and the typicality 
of unclassified data Xj, j = 1 , 2 , . . . , n, is defined in the next section. 

Let yi and Si denote the sample mean and covariance of the labeled data 

~ {yu j yi2 : ■ ■ • : y irrii } 



for each mixture component Gi,i = 1,2, . . . , g, respectively, then we have 



y* = 




( 1 ) 



/ 1 L 1 

s* = — — r Y^{y^] - yi){y^o - yi)^, (2) 

m, i 

for i = l,2,...,g. 

It can be proved that Si is a positive definite symmetric matrix. The Maha- 
lanobis squared distance between two examples yij and y^ with respect to Si is 
defined as [H] 

D{y^j,yi; Si) = (yij - y,fS~\yij - yi). (3) 



For a given labeled example yij,i = 1,2, . . . , g, j = 1,2, ... ,rrii, deleting yij 
from the labeled data and recomputing yi and Si, we obtain the correspond- 
ing yi(ij) and Si(ig), respectively. Consider the Mahalanobis squared distance 
between y^ and yi(ij), that is, D{yij,yi{ij)', which can be computed ac- 

cording to equations 0 , 0 , and 13) , and see whether yij severely contaminates 
the estimates of the mean and covariance matrix of the subpopulation Gi. 

Let 



c ( toj , = 



{rrii - l)vi 
mip{vi + p — 1 ) ’ 



( 4 ) 



where p is the dimensionality of the observed data, Vi = rrii — g — p, the function 



Kyij) = c(mj,i^^)L>(yy,yi(y); Si(y)) 



( 5 ) 
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is distributed according to an distribution with p numerator and Vi denom- 
inator degrees of freedom [§]. It can be shown that [lO] 



Thus, h{yij) can be easily computed since we only need to compute and 
Si once, instead of recomputing yi(ij) and Si(ij)) for each yij. 



Fig. 1. Gij = the area to the right of the observed value under the distri- 
bution 

Let ttij denote the area to the right of the observed value of under the 
Fp^i,. distribution, that is, aij denotes the cumulative probability of the Fp^^. 
distribution, shown in Fig. [TJ 



which can be obtained by looking up the table of percentiles of the F distri- 
bution, then aij can be used to assess whether yij is an outlier. Generally, an 
identification of observations which appear to stand apart from the bulk of the 
data is undertaken with a view to taking some action to reduce their effect in 
the formation of any estimates. Although, in some situations, the atypical ob- 
servations may be of much interest, our objective in this paper is to estimate 
the parameters of data distribution under the normality hypothesis. Therefore, 
we wish to select only typical examples and remove the atypical examples. The 
basic idea is can be used as a metric of typicality of examples. If is close 

to zero, then is regarded as an atypical example of Gi. Thus, for a given 
typicality threshold, say a, if aij is greater than or equal to a, then y^ is said 
a typical example and could be selected, otherwise yij is discarded. 




{vimr/p)D{yij,yi, S^) 



( 6 ) 




h(v.) 



a^j = Pr{Fp^„. > h{y,j)) = Pr{F„.^p < h{yij)) 



(7) 



4 Typicality Measure of Unlabeled Examples 

For the unlabeled data, typicalities are easier to calculate since they are inde- 
pendent of the computation of the sample means and therefore covariances of 
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mixture components. Before yij is assessed it is already known generated from 
Gi- All we need to do is to assess how it is typical to Gi, while the mean and co- 
variance of Gi are estimated in terms of yij only, j = 1, 2, . . . , mi. The typicality 
of unlabeled example x^-, however, is assessed with respect to each component 
GiG = 1,2, ...,g. If Xj is atypical to all components, then it is said atypical, 
otherwise it is said typical to the component with respect to which Xj has the 
highest typicality. 

Suppose that the mean, y^, of the component Gi is calculated based on the 
labeled data as before in 0. For each unlabeled example Xj, the Mahalanobis 
squared distance between Xj and yi,i = l,2,...,g, can be defined as 

D(x^-,y*;S,) = (x^- - y,)^S"^(xj - y,). (8) 

Similarly, the following function 

h{xj) = c{rrii -|- 1,^-* -t- l)L>(xj,y^; S*) (9) 

also has an distribution where Vi + 1 = rrii — p and the function c(-, •) is 

defined by (01) as before. Since Xj is independent of computation of y^ and Si, 
is easier to compute than ©• Define aij as the tail area of the right of the 
observed value of under the distribution as the similar definition for 

labeled data yij. Thus Oij is viewed as the typicality measure of the observation 
Xj with respect to the component Gi. The closer is aij to zero, the more atypical 
is Xj with respect to Gi. For each component Gi,i = 1, 2, . . . , g, aij is computed 
and the maximum of oij, 02 ,^, . . . , Ogj, denoted by aj, can be viewed as the 
typicality measure of Xj with respect to the mixture model G. Therefore, for the 
given typicality threshold a, if 

Qj = max Gij < a, (10) 

i=l,2,...,g 

then Xj is said to be an atypical example and could be discarded, otherwise Xj 
is a typical example and could be selected. 

As it is pointed out in PI, the value of the typicality threshold a depends 
on how the presence of apparently atypical observations is handled. In our case, 
the aim is deleting all atypical observations and sampling only the typical ob- 
servations to reduce the data size for learning the classifiers, a might be set at 
a conventional level, say a = 0.1, or 0.05. a might be set at 0.01 or 0.005 to 
eliminate only those observations assessed as being extremely atypical. 

5 Algorithm for Selecting Typical Examples 

Once the typicalities of labeled and unlabeled data are obtained, it is simple 
to select typical examples. The proposed algorithm for selecting typical exam- 
ples is described as follows. Algorithm SelExample, shown in Fig. [21 is the main 
program, whose input consists of the labeled and unlabeled data set D, typical- 
ity threshold, and capability of the main memory, and the algorithm outputs a 
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sample of D which consists of all typical labeled examples and those typical un- 
labeled examples randomly drawn not exceeding the capability of main memory 
by calling two procedures SelLabeled and SelUnLabeled, shown in Fig. [3] and |4] 
respectively. 

In algorithm SelExample, lines 2 through 4 compute the sample mean and 
covariance of component Gi according to equations dD and respectively. 
Line 5 calls procedure SelLabeled to select all typical labeled examples, where 
y =< yi,y 2 , ■ ■ ■ lYg > and S =< S\, S 2 , ■ ■ ■ ,Sg > are sample mean vector and 
covariance vector, respectively. Lines 6 and 7 call procedure SelUnLabeled to 
randomly draw unlabeled examples and select typical ones. Finally, the algorithm 
outputs the sample Dq = {Xo,Yq} in line 8, where Xq contains the typical 
unlabeled examples, while Yq encompasses the typical labeled examples. 



Algorithm SelExample for selecting typical examples 
Input: Data set D = {X, Y}, with labeled set Y of size m 
and unlabeled set X of size n; 
typicality threshold a; memory size N > m 
Output: A sample Do = {Xq, Yq} C D 

1 begin 

2 for each mixture component Gi,i = 2, . . . , <? do 

3 compute sample mean y; and covariance Si\ 

4 end 

5 Yq = SelLabeled(y, S, a); // select typical labeled examples 

6 mo = |Yo|; 

7 Xo = SelUnLabeled(y, S,a,N — mo); / / select typical unlabeled examples 

8 Output Do = {Xo,Yo}; 

9 end 



Fig. 2. The Algorithm for Selecting Typical Examples 



Procedure SelLabeled, shown in Fig. |21 is used to select typical labeled ex- 
amples. Initially, the output labeled sample Yq is empty. Lines 3 through 12 
select all typical labeled examples. Line 4 calculates the degree of freedom 
and the coefficient Ci for Gi in terms of equation For each labeled example 
yij, the Mahalanobis squared distance between it and the sample mean y^ is 
found according to equation (|3|) and the function h is found in terms of equa- 
tion ® respectively in lines 6 and 7. In lines 8 and 9, the tail area of the 
distribution of the observed value y^ is computed, and if the tail area is greater 
than the typicality threshold, then y^ is sampled. 

Procedure SelUnLabeled, shown in Fig. S) is used to select typical unlabeled 
examples. Xq is initialized to be empty. Lines 3 through 16 select all typical 
labeled examples. Line 4 randomly draws an unlabeled example Xj, and lines 
6 through 12 find the maximum of the tail areas with respect to Xj for all 
components. Like procedure SelLabeled, line 7 computes the degree of freedom 
Vi and coefficient c^. Lines 8 through 10 are similar to lines 6 through 8 of the 
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Procedure SelLabeled(y, S, a) 

1 begin 

2 initialization: set Yq empty; 

3 for each subset of labeled data Yi, i = 1,2, . . . , g, do 

4 Vi=mi-g-p\ Ci=c{mi,Ui)-, 

5 for each yij G Yi,j = 1,2, .. . ,mi, do 

6 compute the distance of yij from yt 

7 dij = D{yij,yi-Si)-, hij = h{yij); 

8 compute a — P{Fp^,j. > hij) = P{Fi,^^p < hij)-, 

9 if a > a then Yq 4= Yq U {y^}; 

10 end 

11 end 

12 return Yq; 

13 end 



Fig. 3. The Procedure for Selecting Typical Labeled Examples 



procedure SelLaheled, but the computations are with respect to Xj instead of 
yij. Line 8 computes the distance between Xj and according to equation (EJ), 
and line 10 finds the tail area of the F distribution of Xj with p numerator and 
t'i denominator degrees of freedom. In lines 13 and 14, if the typicality of Xj 
is less than or equal to the typicality threshold, then Xj is discarded, and the 
algorithm proceeds to draw at random next example for selection. Otherwise, 
Xj is added to the output sample. When the memory capability is exceeded, or 
the entire data set has been scanned, the procedure stops. 

In the main algorithm SelExample, the for loop from line 3 to line 5 runs g 
times and in each loop the run time complexity is 0{p^mi), z = l,2,...,g, so the 
total time for computing the sample means and covariances is 0{p^rrii) = 

0{mp^). The procedure SelLaheled computes typicalities for all of m labeled ex- 
amples, and each needs time of 0{p^), so the time complexity is also 0{mp^). In 
the procedure SelUnLabeled, the for loop from line 6 to line 12 takes time 0{p^g), 
and it executes at most n times (in the worst case all unlabeled examples are 
randomly drawn and verified), therefore the time for selecting typical unlabeled 
examples is 0{gnp^). Thus, the total run time is 0{gnp^) since m << n. On the 
other hand, each unlabeled example only needs to be accessed once. Although 
the labeled examples needs to be accessed at least twice, they can be explored 
in the main memory before the two procedures are called since the labeled data 
set is small. Hence, the entire original data set needs only be scanned once. 

6 Concluding Remarks 

We propose an approach to selecting typical examples for learning classifiers with 
a small set of labeled training examples and a very large set of unlabeled training 
examples. Our basic idea is to estimate the sample means and covariances of 
the labeled examples, compute the Mahalanobis squared distance between each 
training example and the sample means, and verify whether a training example 
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Procedure SelUnLabeled(y, S,a,N) 

1 begin 

2 initialization: set Xo empty; 

3 loop 

4 randomly draw an unlabeled example Xj & X,1 < j < n; 

5 a = 0; 

6 for each mixture component Gi,i = 1,2, ... ,g, do 

7 Ui = mi—p- a = c{mi + 1, i^i) 

8 compute the distance of Xj from yi 

9 dij = D(xj,yi', Si)\ hij = adij-, 

10 compute ao = P(Fp, i^i > hij) = P(F„^^p < hij)-, 

11 if ao > a then a = ao; 

12 end 

13 if a < a then goto loop 

14 Xo4=XoU{xj}; 

15 if |Xo| > N then exit loop 

16 end 

17 return Xo; 

18 end 

Fig. 4. The Procedure for Selecting Typical Unlabeled Examples 



is far from the bulk of the data. The classifier can be learned by estimating 
the parameters of data distribution functions under the assumption of Gaussian 
mixture model via the EM algoritm. Using this approach, the entire data set is 
just scanned once regardless of its size, and all computations can be completed 
in main memory. We are applying this approach to an actual application and 
the result will be reported subsequently. 

Two points should be emphasized. First, the number of labeled examples 
for each class (component) Gi must be greater than the dimensionality of the 
data space, which means rtii > p,i = 1,2, ... ,g, because the sample means are 
computed based on the labeled examples. Otherwise, the sample means and co- 
variances may have large deviations, and the typicality estimate may be far from 
the reality. Second, the labeled examples must be independent of the mixture 
distribution of the data set, that is, nii are independent of the component pro- 
portion 7Ti, for i = 1,2, ... ,g. This assumption assures that the labeled examples 
provide no information about the prior probability of each class, which is the 
requirement of finding maximum likelihood estimation via the EM algorithm. 

This approach can be extended to apply to other mixture models with, say, 
exponential components. In that case, the typicality measure of instances may 
need another but similar definition. 
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Abstract. Controllers based on artificial neural networks (ANNs) are 
demonstrating high potential in the nonconventional control of nonlinear 
dynamic systems. This work presents a comparative study of neuro-controllers 
based on different topologies of ANN, namely the feedforward neural network 
(FNN) using a generalized weight adaptation algorithm, and the diagonal 
recurrent neural network (DRNN) using a generalized dynamic back- 
propagation algorithm. Also, both on-line and off-line training methodologies 
of the adopted ANNs are investigated. The study is based on controlling the 
system using a coordination of feedforward controllers combined with inverse 
system dynamics identification. Simulation results are used to verify the effect 
of varying the topology of the ANNs and training methods on the control and 
system performance of nonlinear dynamic systems. 



1 Introduction 

Conventional controller design usually involves complex mathematical analysis and 
yet has many difficulties in controlling highly nonlinear plants. To overcome some of 
these difficulties, a number of approaches using neural networks for control have been 
proposed in recent years. The use of artificial neural networks (ANN) helps the 
controller design to be rather flexible, especially when plant dynamics are complex 
and highly nonlinear. A widely used scheme for the use of ANNs in nonlinear system 
control is to identify the inverse dynamics of an unknown plant and then apply the 
identification model as a feedforward controller [3], [6]. 

ANNs can be classified into one of three classes based on their feedback link 
connection structure[8-10]: recurrent (global feedback connections, e.g., dynamic 
recurrent neural networks[4]), locally recurrent (local feedback connections, e.g. 
Diagonal Recurrent Neural Network (DRNN) [2]), and nonrecurrent (no feedback 
connections, e.g., feedforward neural network [1]). 

Tanomaru and Omatu [5] addressed the question of how to perform on-line training of 
multilayer neural controllers in an efficient way in order to reduce the training time. 
Kuschewski, Hui, and Zak [1] presented methods for identification and control of a 
nonlinear dynamic system based on feedforward neural networks (FNNs) using 
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generalized weight adaptation algorithm. They used the identified model of the 
inverse dynamics of a system as a part of a feedforward controller combined with a 
compensator to control such a nonlinear system. 

Park, Choi, and Lee [7] used a multilayer neural network to design an optimal 
tracking neuro-controller for discrete-time nonlinear dynamic systems. The controller 
consists of two parts : the first part is a feedforward controller to track the steady-state 
output of the plant. The design of this controller used a novel inverse mapping concept 
for off-line training of a neuro-identifier. The second part is a feedback controller to 
control the transient output of the plant. The design of this controller used a NN 
trained with a generalized back-propagation-through-time algorithm that minimize a 
quadratic cost function. 

In this paper, both FNN trained with a generalized weight adaptation algorithm and 
DRNN with a dynamic back-propagation algorithm (DBP) are used for inverse 
dynamics identification of a nonlinear dynamic systems. Next, a conventional control 
method that uses feedforward control combined with a compensator is illustrated. 
Using the idea of combining the inverse plant dynamics, in the feedforward control 
path, both FNN with on-line training and DRNN with on-line, and off-line training, 
are applied to investigate the effect of using different topologies for inverse 
identification on system performance. 

The remaining of this paper is organized as follows: In section 2, the adopted ANN 
architectures for the inverse identification are described. In section 3, the theory of the 
coordination of feedforward control method is briefly discussed. In section 4, the 
models of ANN based controllers with both on-line and off-line training are presented. 
In section 5, case studies and simulation results are given to investigate the effect of 
both adopted ANN topologies and training methodologies on system performance. 
The conclusions of the paper are given in section 6. 



2 The Adopted ANN Architectures 

Here, two neural network topologies with their adaptation algorithms are briefly 
reviewed. 

2.1 A FNN Topology Trained with Generalized Weight Adaptation Algorithm 

A block diagram of a two-layer FNN is shown in Fig. 1. The elements of such a neural 
network are given in [l].The weight matrix update algorithm is given by [1] : 

Wi,k+l=Wi,k + AWik i=l,2 (1) 



AW 



l,k“ 



-2zi,^eJ (x) 
oj (x)x 



Where 



0 



If0i'^(x)x5tO 
If0i'^(x)x = O 



( 2 ) 
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and 



AW. 



2,k = 



(yik) T 

-2W21,- 7 if 02 %i,k)yi,k^o 

’ 0j(yi,k)yuc (3) 



0 



If 02 (yi,k)yi,k = 0 



Here, 0]^ and 02 are operators, of the form 






sgn(A:„ 




Let ejj be the FNN error , on the klil iteration, ej^ = y^j-yj^ , where is the desired 
output of ANN, and yi^ is the output at time index k. It can be show that [1], 

^k+l = (In2'^)®k- (4) 

where is n 2 xn 2 identity matrix. 

Thus, for the error to converge to zero, the eignvalues of the matrix (Ij^^ - A) must lie 
within the unit circle. A can be chosen as a diagonal matrix. 

2.2 A DRNN Topology Trained with a Dynamic Back-Propagation Algorithm 




Fig. 2. Diagonal recurrent neural network. 



Linear 




Wj 



D 



Sigmoid 

neuron 



A block diagram of the DRNN is shown in fig. 2, where for each discrete time k, Ij(k) 
is the ilil input, Sj(k) is the sum of inputs to the jlJl recurrent neuron, Xj(k) is the 
output of the jlil recurrent neuron, and y(k) is the output of the network. wl is the 
input weight matrix, wl^ and are diagonal and output weight vectors, 

respectively, defined on R*^^, R*^°. The mathematical model for the DRNN 
shown in Fig.2. is 
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nd 

y(k) = J^W°x^(k), Xj(k) = f(Sj(k)) 

j=i „i 

s^(k) = w°x^(k-i) + £w'i.(k) 

i=l 



(5) 

( 6 ) 



3 Theory of the Coordination of Feedforward Control Method 

In this section, a conventional control method (coordination of feedforward control 
method) is presented[l], for a nonlinear dynamic system as shown in Fig. 3. 

3.1 Coordination of Feedforward Contol Method (CFCM) 

A block diagram of the conventional CFCM controller is shown in Fig.3. 




H H 

Tl 



H H 

B 

Fig. 3. The block diagram of the CFCM. .controller 



The controller design equation are 
PA = B, PG = Tl(I-Tl)- 1 
and if P is invertible 

A = P-1b (7) 

G = P-1Tl(I-Tl)-^ (8) 

Eqn. (7), is called the synthesis equation, and Eqn (8) is the design equation. The role 
of The synthesis and design equations in the CECM controller is illustrated in [1]. 



4 Coordination of Feedforward Control Method Based on ANN 

In this section, two neural-based controller models are described. One model is based 
on an on-line training methodology presented in [1], while the other is based on an 
off-line training methodology for inverse dynamics identification of a nonlinear 
dynamical systems. 
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4.1 ANN Based Controller with On-Line Training 

Now, the model of an ANN based controller is shown in Fig. 4., which has been 
developed in [1]. Using an ANN within the CFCM controller is motivated by the 
observation that the synthesis and design equations are used. Here, an ANN is 
used to identify the inverse dynamics of a nonlinear system. The ANN controller 
is designed to have stable feedback loop, using a linear full-order estimator combined 
with a linear control law, as shown in fig. 4. 




Fig. 4. ANN based controller with on-line training. 

The state-space realization of the linear stabilizing compensator is 
Z — (A — l^C ^ Z — L -f B V ^ (9) where , B , and C 

V = - Kz 

Are the parameters of the linearized plant , q is the compensator input, with an output 

V. 

4.2 ANN Based Controller with Off-Line Training 

The control method in this section is useful as an off-line control design method where 
the plant is first identified and then a controller is designed for it. This assumes that 
the plant can be identified without making the plant unstable [7]. A model for an ANN 
based controller is shown in Fig.5. is developed, which is based on the identification 
of the inverse dynamics of the plant using an off-line training method of a DRNN. 

The estimator/controller compensator is the same as presented in of section 4. 1 . The 
feedforward neuro-controller is constructed to generate feedforward control input 
corresponding to the set point, and is trained by the dynamic back-propagation 
algorithm presented in section 4.2.2. The controlled plant is first identified by a neuro- 











362 



Mahmoud F. Hussin, Badr M. Abouelnasr, and Amin A. Shoukry 



identifier using dynamic backpropagation (DBF) algorithm. Then such an identifier is 
trained to estimate the sensitivity information of the plant need to the feedforward 
controller during training. 




Fig. 5. ANN based controlled system designed with off-line 
training. 



4.2.1 The Neuro-Identifier 

The role of the neuro-identifier is to emulate the plant dynamics. It is then used to 
provide the sensitivity information of the plant to the feedforward neuro-controller. 
Training of the neuro-identifier can be regarded as an approximation process of a 




Fig. 6. Schematic training of the neuro-identifier. 



nonlinear function input-output mapping of a given data sets using supervised 
learning. The process of training the neuro-identifier is shown in Fig.6. Input-output 
training patterns are obtained from the operation history of the plant. Let yni(k) and 
y(k) be the neuro-identifier and the actual output of the plant at time k, then an error 
function can be defined as : 

Eni = l/2(y(k) - yni(k))2 

After the learning process, the plant characteristics are stored as the weight parameters 
of the neuro-identifier. 

lE^,(k) 

The training process is stopped when the -^=1 converges to a small value. 

N 
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4.2.2 Feedforward Neuro-Controller with DBF Training Algorithm 

The feedforward neural controller (FFNC) is trained to learn the inverse dynamics of 
the nonlinear plant.The FFNC network can be developed by using the neuro-identifier 
as shown in Fig. 7. where Uff is the feedforward control input and y^j is the output of 
the neuro-identifier. The training process of the FFNC is to adjust the weight 
parameters so that the output of the neuro-identifier y^j approximates the given 




Fis. 7. Training of feedforward neuro-controller. 



reference output yj.gp and when the training is finished Uff approximates Uj.gp To train 
the FFNC, sensitivity information is needed, which is obtained from the neuro- 
identifier that has been trained already. The DBF algorithm for DRNN[2], which is 
used to train the FFNC is developed. An error function for a training cycle for 
feedforward neural controller can be defined as : 

EFFNC=l/2(yrgf(k)-y„(k))2 (10) 

The gradient of error in (16) with respect to an arbitrary weight vector We is 
given as 

dEpFNC (k)/3W = - epppf(-(k) 3yn;(k)/3W 

= -epppfp;(k)3yni(k)/3uff(k) 3uff(k)/3W 

=-eEENC('^)yuff^k)3uff(k)/3W (11) 

where epppfp;(k) = yj-ef(k) - ym(k) is the instantaneous error between the desired and 
the output of the neuro-identifier, and the factor y^ff(k) = 3yni(k)/3uff(k) represents 
the sensitivity of a neuro-identifier with respect to its input. 



The sensitivity function yjjff(k) can be computed by using the neuro-identifier, as 
follow : 



yuft(k) 



dym (k) 
5uff(k) 



( 12 ) 
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where Uff(k) is an input to the neuro-identifier. By applying the chain rule to (13), 

(13) 



m (k) , ^ ^ '^y ni (k) <5xj(k) 



(9u „ (k) J (9x j (k) (9u jf (k) 



= Ewj 



dx j (k) 
du ff. (k) 



Also from (5), 



dx j (k) , dS j (k) 

. V =f(Sj(k)) J 



(14) 



ff (k) ^u ff (k) 

Since the inputs to the neuro-identifier are Uff(k), Uff(k-1), and yrei(k), eq. (6) becomes 
Sj = Wf Xj(k- 1) + W/jUff (k) + W^jUff (k- 1) 

+ W3Vref(k) + Wijbi (15) 



Thus 

^j(k) 

^ff(k) 



= W‘ 



(16) 



From (13), (14), and (16), 

^ Wf f '(Sj(k))W/j (17) 

001 ff (K) j 

Using the negative gradients in (11), the weights of feedforward controller can be 
adjusted 



5 Case Study and Simulation Results 

5.1 Performance Index 

To quantify how well a neural controller perform its task. Performance PI index is 
used : 



PI = ^ 

n 

where e^ being the instantaneous error at time k, and k=l,2,....,n. 

5.2 Case Study : An Inverted Pendulum Controlled by a dc Motor 

An inverted pendulum controlled by a dc motor via a gear train will be chosen as 
plant, its state space equations, CFCM controller and compansator design also 
described in [1]. 




Comparative Study of Neural Network Controllers for Nonlinear Dynamic Systems 



365 



5.2.1 The Conventional CFCM Simulation Results 



Now, we will use the CFCM controller as mentioned in section III for a nonlinear 
dynamic system given in Eqn.(18), and let the initial conditions be X]^(0) = 1.0, X2(0)= 
0.0, and X3(0)= 0.0, and the initial conditions on all the other blocks are zero. The 
reference signal r(t) is chosen as follows ; 



r(t)=- 



0.2t, 


0<t< 10 


2, 


10<t< 


-0.4t + 8, 


15 


0, 


15 <t< 


-sin(7t/10(t - 


20 


25)), 


20<t< 


0, 


25 




Fig. 8. The CFCM Controller Performance. 



The simulation results are shown in Fig. 8, which shows unacceptable tracking error 
performance, with a PI= 12.562. 

5.2.2 Simulation Results of ANN Based Controller with On-Line Inverse 
Identification 

Now, the effects of applying different topologies of the ANN controllers are 
investigated. 

5.2.2.1 Using a Two-Layer-FNN for On-Line Inverse Identification 

Let the same initial conditions be as in [1], and an 3x3x1 ANN for inverse 
identification, F = 0]^ = 02 = sgn(.), where 

r+1 if X >0, 

sgn(x) = l 

[ 1 if X < 0 

and all initial weights are small random numbers e [0,1], and error reduction function 
A = [0.029]. Fig. 9 illustrates the simulation results at a sampling rate T=0.04. It 
shows an improved performance of 88.26% 

5.2.2.2 Using A DRNN for On-Line Inverse Identification 

The effect of using a DRNN topology is investigated. Taking the initial conditions as 
before, and let the DRNN be 3x7x1. Use r], the learning rates for each layer as 
0.0084. Fig. 10 illustrates the simulation results using T=0.04. An improved 
performance of 32.47% is achieved. 
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Fig. 9. Performance of CFCM with 
FNNfor P 




Fig. 10. Performance of CFCM with 
DRNN for F'. 



5.2.3 The ANN Based Controller with Off-Line Inverse Identification 

Here, we will investigate the effect of an off-line training method using DRNN 
topology. Take the initial conditions as in the previous cases for the plant, and the 
compensator. 

5.2.3. 1 Training of the Neuro-Identifier 

The neuro-identifier network consists of a 3x9x1 DRNN. The Learning rates for the 
input, hidden, and output layers are respectively, 0.006, 0.002, 0.04. The training 
patterns of the 

neuro-identifier are generated by solving the mathematical model of the plant with 
random initial values e [0,1] and a unit step input . Fig. 11 illustrates the simulation 
results of the neuro-identifier. 

5.23.2 Training of the Feedforward Controller 

The architecture of the feedforward controller network is a 3x7x1 DRNN, and the 
learning rates for the input, hidden, and output layers are 0.0015, 0.0021, 0.044, 
respectively. By using a DBP algorithm, which was defined in section 4.2.2, and 
different sets of reference outputs as training patterns for the feedforward controller, 
the simulation results shown in Fig. 12 are obtained. 





Figll Simulation Results of the 
Neuro-Identifier. 



Fig. 12. Performance of training 
the Feedforward Neuro- 



Comparative Study of Neural Network Controllers for Nonlinear Dynamic Systems 



367 



S.2.3.3 Applying Controller 

Now, the effects of applying DRNN topology with off-line training controllers are 
investigated. After training of both the neuro-identifier and the feedforward neuro- 
controller, the controller is applied to our third order nonlinear dynamical system 
stated in sec 5.2. Fig. 13 shows an improved performance of 97.48% 

5.2.4 Results 



The simulation results shown in sections 5.2.2 and 5.2.3 show that the use of DRNN 
topology give a good performance for an ANN based controller with both on-line and 
off-line inverse identification compared with the two-layer FNN, as shown in. Table 1. 



Table 1. performance index of different controllers 



Controller 


PI 


CECM 


12.562 


A Two-Layer ENN with on- 
line inverse identification 


1.475 


A DRNN with on-line 
inverse identification 


0.996 


A DRNN with off-line 
inverse identification 


0.025 




Fig 13. Performance of CFCM with 
DRNN for P ', and off-line training. 



6 Conclusions 

The idea of using a CFCM that was proposed by Zak, and the use of a FNNM within 
the FCM structure is extended in this paper to include the use of DRNN instead of the 
FNNM. This has lead to an improved performance of the overall controlled system to 
about 30% (for the case study I). Also, the training methods of the used NN, as 
controllers were studied to show that the off-line training of the neurocontroller gives 
an improved performance of about 70% (for the broom-stick problem) as compared to 
on-line training. In general, we conclude that the use of DRNN is promising, as its 
architecture is more suitable to emulate dynamic systems. Also, off-line training with 
enough training patterns can be used to build good performance neural-based 
controllers. 
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Abstract. This paper introduces a problem solving method involving 
independent agents and a set of partial solutions. In the Iterative Multi-Agent 
(IMA) method, each agent knows about a subset of the whole problem and can 
not solve it all by itself. An agent picks a partial solution from the set and then 
applies its knowledge of the problem to bring that partial solution closer to a 
total solution. This implies that the problem should be composed of sub- 
problems, which can be attended to and solved independently. When a real- 
world problem shows these characteristics, then the design and implementation 
of an application to solve it using this method is straightforward. The solution 
to each sub-problem can affect the solutions to other sub-problems, and make 
them invalid or undesirable in some way, so the agents keep checking a partial 
solution even if they have already worked on it. The paper gives an example of 
constraint satisfaction problem solving, and shows that the IMA method is 
highly parallel and is able to tolerate hardware and software faults. 
Considerable improvements in search speed have been observed in solving the 
example constraint satisfaction problem. 



1 Introduction 

Encoding a real-world problem to be solved by a computer is often time consuming, 
as it requires a careful mapping process. If the representation inside the computer 
matches the original problem domain, then the design and implementation can be 
done faster, and there will be fewer bugs to deal with. Many complex problems are 
composed of different sub-problems, and solving them requires the collaboration of 
experts from different fields. One example is designing an airplane. The sub-problems 
usually interact with each other, and a change in one place can invalidate the solution 
to another part. This can happen in unpredictable ways. This form of emergence is 
usually considered a nuisance in engineering fields, as it makes it harder to predict the 
behavior of the whole system. Handling this emerging interaction is not trivial for big 
problems, and is worse when the problem domain is vague and there is limited or no 
theoretical knowledge about the possible interactions between the sub-problems. 

In this paper we propose to use the Iterative Multi-Agent (IMA) method for solving 
complex problems. Section 2 explains the IMA method in general and outlines the 
basic principles. Section 3 presents a practical application of the IMA method to solve 
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a computationally demanding Constraints Satisfaction Problem. Section 4 concludes 
the paper and outlines some of the future directions of this work. 



2 Segmenting a Complex Problem 

The IMA method is a decentralized method involving the use of multiple agents, in 
the form of software processes, to solve complex search problems. Each agent has 
knowledge relevant to part of the problem, and is given responsibility for solving one 
or more sub-problems. The agents do not collaborate directly, and the final solution to 
the problem emerges as a result of each agent's local actions. Overlap is allowed in 
the agents' knowledge of the problem and in their assigned sub-problems. 

Often a problem P can be expressed in terms of n sub-problems, Py to P„, such that 
P = U Pi- Each sub-problem P, can have a finite or infinite number of candidate 
solutions Si = { Sii, Sa, . . . } , some of which may be wrong or undesirable. Each partial 
solution of the whole problem contains a candidate solution for each sub-problem. We 
represent a partial solution as an array [5i/, S 2 j, ^ 3 *:...], where Su is a candidate solution 
to sub-problem Pi, 82 ] is a candidate solution to sub-problem P 2 , and so on. 
Segmenting a bigger problem into sub-problems makes the task of writing agents to 
solve the sub-problems easier, as a smaller problem is being attacked. As well, if a 
sub-problem is present in many problems, it may be possible to reuse an agent that 
solves that sub-problem. This translates to saves in design, implementation and 
debugging efforts that would otherwise be needed for the development of a 
completely new agent. The system can achieve some fault tolerance if the agents have 
overlapping information [4]. The crash of an agent (or the computer it is running on) 
will not prevent the whole system from finding a solution if the combined knowledge 
of other agents covers the lost agent's knowledge of the problem. This is of special 
interest in long-running programs. If needed, the system can provide security by 
restricting sensitive knowledge to trusted agents. 

Interaction is common in complex systems made of sub-problems. To make it more 
probable that a solution can be found, a set of partial solutions is kept. This set can be 
managed as a workpool for example. The set can be initialized by randomly 
generating partial solutions, some or all of which may be wrong. Having more than a 
single partial solution matches the existence of more than one agent, and makes a 
parallel search for a solution possible. Having a set of solutions to work on also 
means that we can end up with different final solutions, possibly of different qualities. 
Eor example, some may be elegant, but expensive to implement, and others may be 
cruder, but cheaper. These solutions can be used for different purposes without the 
need to develop them separately. This enables a two level problem solving strategy. 
At the first level, "hard requirements" are met in the final solutions. At the second 
level, "soft requirements" are considered in the process of selecting one or more of the 
final solutions produced at the first level. 

The following describes how the system works. The problem is defined in terms of 
its sub-problems, and agents are assigned the job of solving these sub-problems. Each 
agent checks one partial solution at a time to see if it satisfies its assigned sub- 
problem(s). If not, it tries to create a new partial solution. The new partial solution 
may be invalid for another sub-problem, so other agents will have to look at it later 
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and if needed, find another solution for the sub-problem with the invalidated solution. 
This is an iterative process, because agents can repeatedly invalidate the result of each 
other's work in unpredictable ways. Each agent is searching its own search space for 
solutions to its part of the problem, in the hope of finding a place that is compatible 
with all other parts. All the agents should work on a partial solution in order to make 
sure that all the sub-problems in that partial solution are solved. This decentralized 
form of activity by the agents makes it possible for them to run in parallel, either on a 
multiprocessor or on a distributed system. A partial solution is removed from the 
workpool when all the agents have looked at it and none needed to change it, 
signifying that it has turned into a total solution. They can signal this by setting flags, 
and the first agent that notices that all the flags are set can take the work out of the 
pool. Any changes done by one agent results in all flags being reset. The algorithm 
can stop when one or more solutions have been found. To make sure the program 
terminates, it can count the number of times agents work on partial solutions, and stop 
when it reaches a value determined by the user. The best partial solution at that time 
can then be used. 

The system designer can use random methods to divide the problem, assign sub- 
problems to agents, create the initial partial solutions, and arrange for the agents to 
visit the partial solutions. Alternatively, he can choose to use deterministic methods in 
any or all of these activities. But even when everything in the system design is 
deterministic, the algorithm in general does not guarantee finding a solution. That is 
because predicting the effects of the interactions between the agents may not be 
possible by the designer. There is also no guarantee of progress towards a solution. 
However, if the agents operate asynchronously, it is less probable that a specific 
sequence of changes will be performed repeatedly, which in turn makes it less 
probable that the system enters a cycle. In this case, if a global solution exists, then 
given enough time the system is likely to find it. 

IMA is different from methods that use genetic operations. Genetic methods use 
random perturbations followed by a selection phase to move from one generation to 
the next. None of these concepts exists in IMA. The main element that introduces 
randomness into the picture is the interaction between the agents' actions. 



3 Segmenting a Constraint Satisfaction Problem 

This section demonstrates how the IMA method can be applied to a simple form of 
Constraint Satisfaction Problem (CSP). It is easy to automatically generate test CSPs, 
and they are decomposable into interacting parts. Since we want to apply the IMA 
method to exponential problems, we use the backtracking method to solve the 
generated CSPs. 



3.1 Constraint Satisfaction Problem 

In a CSP we have a set of variables V = {v^, each of which can take on values 

from a set D = {dj, dz-..} of predefined domains, and a set C = {c/, C2-..} of 
constraints on the values of the variables. A constraint can involve one or more 
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variables. Finding the solution requires assigning values to the variables in such a way 
that all the constraints are satisfied. Sometimes we need all such assignments, and at 
other times just one is enough. It may be desirable to have partial solutions, which are 
variable assignments that satisfy some, but not all of the constraints. 

Many problems can easily be formulated as CSPs, and then standard methods can 
be used to solve them. CSPs have been used in scheduling, time tabling, planning, 
resource allocation and graph coloring. In most cases finding a solution to a CSP 
requires domain-specific knowledge, but some general methods are applicable in 
many situations. The traditional way of solving a CSP is to assign a value to one 
variable and then see if the assignment violates any constraints involving this variable 
and other previously instantiated variables. If there is a violation, another value is 
chosen; otherwise we go to the next unassigned variable [3]. If we exhaust the domain 
of a variable without success, then we can backtrack to a previously instantiated 
variable and change its value. 



3.2 Related Work 

Several methods rely on multiple-agents to solve a CSP. The Distributed Constraint 
Satisfaction Problem (DCSP) is defined as in [6]. In DCSP the variables of a CSP are 
divided among different agents (one variable per agent). Constraints are distributed 
among them by associating with each agent only the constraints related to its variable. 
The asynchronous backtracking algorithm [6] allows the agents to work in parallel. 
Unlike the classic backtracking algorithm, it allows processes to run asynchronously 
and in parallel. Each agent instantiates its variable and communicates the value to the 
agents that need that value to test a constraint. They in turn evaluate it to see if it is 
consistent with their own value and other values they are aware of. Infinite processing 
loops are avoided by using a priority order among the agents. In Asynchronous weak- 
commitment search [6], each variable is given some initial value, and a consistent 
partial solution is constructed for a subset of the variables. This partial solution is 
extended by adding variables one by one until a complete solution is found. Unlike 
the asynchronous backtracking case, here a partial solution is abandoned after one 
failure. 

In [1], cooperating constraint agents with incomplete views of the problem 
cooperate to solve a problem. Agents assist each other by exchanging information 
obtained during preprocessing and as a result improve problem solving efficiency. 
Each agent is a constraint-based reasoner with a constraint engine, a representation of 
the CSP, and a coordination mechanism. This agent-oriented technique uses the 
exchange of partial information rather than the exchange or comparison of entire CSP 
representations. This approach is suited to situations where the agents are built 
incompatibly by different companies or where they have private data that should not 
be shared with others. 



3.3 Segmented Constraint Satisfaction Problem 

In the Segmented CSP (SCSP) each constraint is considered a sub-problem. A 
variable can be shared among more than one constraint, so there is interaction 
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between sub-problems. The SCSP Solver (SCSPS) uses a workpool model of parallel 
processing [2]. A number of partial solutions of the problem are created by assigning 
random values to each variable. Multiple agents are then started to solve the CSP. 
Each of these agents is given some knowledge about the problem by assigning it a set 
of variables and a set of constraints. There is no need to worry about the 
interdependencies among the constraints assigned to different agents. Each agent 
determines the values that can be assigned to its assigned variables. Having a 
constraint entitles the agent to test its validity. Agents ignore other variables and 
constraints in the problem, as they do not need to know anything about other agents' 
knowledge of the problem. This greatly eases the designer's initial knowledge- 
segmenting job, which can even be done randomly. Any changes made later to an 
agent's knowledge of the problem will not necessarily result in a wave of changes in 
other agents. It also does not matter if the variables in a constraint are not assigned to 
the agent that has that constraint, as other agents that are assigned those variables will 
be responsible for changing their values. 

Dividing the problem among the agents means that each one of them has to search 
a smaller space to solve its portion of the problem, and so it can run faster, or run on 
slower hardware. If there are v variables and c constraints in the problem, each agent j 
will have to deal with v, < v variables and Cj < c constraints. Agents can use any 
method in solving the portion of the problem assigned to them. Each agent may have 
to do a complete search of its own assigned space more than once. This is because a 
given state in agent /s search space may not be consistent with the current state of 
another agent k, while that same state may work when agent k goes to another location 
in its search space. 

Each agent gets a partial solution from the pool, tries to make it more consistent, 
and then returns it to the pool. The partial solution is accessible to only one agent 
while it is out of the workpool. An agent never interacts with others directly because 
all communication is done through the workpool. This simplifies both the design and 
the implementation by reducing the amount of synchronization activity. If there is 
enough work in the pool, all the agents will be busy, ensuring that they can all run in 
parallel. In a multi-processor or a multi-computer [5], this could result in execution 
speed-up. Agents can disturb other agents' work by changing the value of a common 
variable, or by signaling the need for a change in a variable. This means that a partial 
solution must be repeatedly visited by the agents. 



3.4 The Demonstration Program 

The program SCSPS java implements SCSPS. It is written in Java and developed 
using Sun's Java Development Kit version 1.2. The test computer was a 120MHz 
Pentium with 96MB RAM and running Windows 95. SCSPS creates a problem by 
generating some variables and a set of constraints on them. The constraints are of the 
form X + y < (X, where x g {xi,...,Xh\ and y G{y/,...,y^,}are positive integers within a 
specified range, and a is a constant. It is possible to create harder or easier problems 
by changing the domain or constraint limits. Eigure 1 shows one possible set of 
variables and the constraints on them. 
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Variables: Xo, Xi, X 2 , X 3 , X 4 

Constraints: X 3 + X 2 < 10, Xo + X 2 < 7, Xo + Xi < 6 , Xi + Xi < 12, X4 + X3 < 5 



Fig. 1. The variables and constraints of an example CSP. 

One practical use for this type of constraints is in Job-shop scheduling, where each 
job has a start time start which is a variable, and a duration duration which is a 
constant The problem can be defined as the requirement that for each pair of jobs j 
and k we should have only one of start. + duration. < start, or start, + duration, < 
start.. An expert decides which one of these two constraints is to be present in the 
final set of constraints. The aim is to find suitable values for all start variables so that 
all the constraints are satisfied. 

After the random problem is generated, the program creates a workpool and fills it 
with random partial solutions, which are probably inconsistent. Agents are then 
created as independent Java threads and each is randomly assigned some variables 
and constraints. It is possible for some variable(s) and constraint(s) to be assigned to 
more than one agent. Having a variable enables an agent to change its value. Having a 
constraint enables an agent to test it. An agent with an assigned constraint should have 
read access to the variables of that constraint. The agents run in a cycle of getting a 
partial solution from the pool, working on it, and putting it back. After completing a 
cycle, the agents wait for a small, randomly determined amount of time before going 
on to the next iteration, thus making sure that there is no fixed order in which the 
partial solutions are visited. The randomness present in the design means that the 
solutions will differ from one run to the other, even when working on the same 
problem. The workpool counts the number of times it has given partial solutions to 
the agents, and terminates the program as soon as it reaches a predetermined value. In 
general it is possible to let the agents run indefinitely, or until all the partial solutions 
are consistent. The main thread of the program runs independently of the agents and 
can automatically check all the partial solutions in the workpool and print the 
inconsistencies. Figure 2 shows two of the agents working on the example CSP. 



Agent 1 




Agent 2 


Owned variables: Xp, X2, X3 




Owned variables: Xp, Xj 


Constraints: 




Constraints: 


X3 + X2 < 10, Xq+ Xj < 6, X] + X] < 12 




xp + X2 < 7 



Fig. 2. Two agents working on the example CSP. 
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Figure 3 shows the workpool for the example CSP and some of the partial solutions it 
contains. The partial solutions change over time. 



Partial Solution 1: [1,3,4,2,2] 
Partial Solution 2: [7,2,6,1,3] 
Partial Solution 3: [3,7,34,4] 



Fig. 3. Workpool of partial solutions for the example CSP. 

A partial solution for n variables can be considered a vector representing a point in 
an n dimensional space. Each agent has to deal with only m < n variables. There is a 
"change" flag associated with each variable, which is used by agents to signal the 
need for that variable's value to change because its current value violates a constraint. 
This need is detected by an agent that has the violated constraint, but the actual 
change should be done by one of the agents that is assigned the variable. 

Each agent starts the processing by getting a partial solution from the pool and then 
changing the value of all the variables that it is assigned and which have their 
"change" flag set. These variables should change because, as detected by other agents, 
their values violate some constraints. A partial solution will thus move from its 
currently unsuitable point. This is done in the hope of finding another point that 
satisfies more constraints. Agents then try to ensure that the variables they are 
assigned satisfy the constraints they are aware of. Each constraint x + y < a can be 
one of three types, and what the agent does depends on this type. 

1. If both X and y are assigned to the agent, a slightly modified version of 
backtracking is used to find suitable values for x and y. The modifications have to 
do with the fact that here finding a solution is a multi-pass process. Eor instance, 
the variables are changed from their current values, as opposed to a fixed starting 
point, thus making sure that the whole domain is searched 

2. If only one of x and y are assigned to the agent, only the value of the variable that 
is assigned to the agent is changed, and the other variable is considered a 
constant. 

3. If none of x and y are assigned to the agent, then none of them is changed. 

Variables are changed by incrementing their values, with a possible wrap-around to 
keep them within the specified domains. This is to make solving the problem harder, 
as a trivial solution for this type of constraints is to simply use the smallest values of 
the variables. After this phase, each agent inspects its constraints of the second and 
third types. If it finds an inconsistency, it sets the "change" flag of the unassigned 
variable(s). This is done because this agent has done all it can, and now is signaling 
the failure to others. Another agent that is assigned these inconsistent variables will 
get this work later and change the values. The agents continue like this until the 
workpooTs counter for the number of checked-out partial solutions reaches the limit. 
At this point no more partial solutions are given out and the agents stop executing. 
The main processing loop in each agent is shown in figure 4. 
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1. Get a partial solution from the pool 

2. Alter my variables that have their “change” flag set by other agents. 

3. Use my constraints to find a consistent value for my variables. 

4. Check the constraints with one or two missing variables; set their “change” flag if an inconsistency is found. 

5. Give the partial solution back to the pool. 



Fig. 4. The algorithm followed by each agent in SCSPS.java. 



Table 1 contains the results of several runs of the program to solve six problems 
using 1, 5, 10 and 20 agents. 30 runs were attempted for each row. There were 21 
partial solutions in the workpool, and they were created randomly for each run. The 
values under the columns Variables, Constraints, and Agents, represent the 
corresponding values in the generated problem. No checks were done to see if a 
partial solution has been turned into a total solution, and so the value of Rounds 
determined the number of times the workpool manager gave out partial solutions to 
the agents before the program stopped. This value is 1 when there is a single agent in 
the system, because in this case a standard backtracking method was employed, and a 
single solution was sought (multi-agent runs usually find many more solutions). Runs 
that did not finish within the time limit of 120 seconds were aborted. The number of 
aborted runs is indicated under the Timed out column. The 4 rows within each 
outlined box correspond to the same problem. 



Vars 


Constraints 


Agents 


Rounds 


Timed out 


Run Time (ms) | 


Avg 


Best 


Worst 


Std Dev 




10 


1 


1 


0 


130 


100 


390 


57 




10 


5 


15000 


0 


18006 


17520 


24600 


1428 




10 


10 


15000 


0 


8958 


8840 


9070 


66 




10 


20 


15000 


0 


6148 


4720 


14830 


2858 




20 


1 


1 


0 


186 


50 


930 


198 




20 


5 


15000 


0 


21765 


18070 


33780 


4924 




20 


10 


15000 


0 


9588 


9060 


10820 


388 




20 


20 


15000 


0 


5447 


4890 


5830 


211 




20 


1 


1 


8 


- 


50 


- 






20 


5 


15000 


0 


17712 


17570 


18290 


132 




20 


10 


15000 


0 


9522 


8890 


22960 


2569 




20 


20 


15000 


0 


4899 


4660 


6310 


335 




40 


1 


1 


4 


- 


60 


- 






40 


5 


15000 


1 


- 


17680 


- 






40 


10 


15000 


0 


9704 


8840 


19230 


2373 




40 


20 


15000 


0 


4855 


4720 


5930 


214 


1 50 1 


50 


1 


1 


27 


- 


110 


- 






50 


5 


15000 


0 


23478 


17680 


62940 


12731 




50 


10 


15000 


0 


8998 


8900 


9192 


51 




50 


20 


15000 


0 


6383 


4670 


52240 


8661 


1 50 1 


100 


1 


1 


30 


- 




- 






100 


5 


15000 


3 


- 


18070 


- 






100 


10 


15000 


2 


- 


9060 


- 




|50_J 


100 


20 


15000 


0 


5801 


4840 


13780 


2497 



Table 1. The results of several runs of the program. 



As expected, a single-agent backtracking method's success rate deteriorates 
quickly, until it stops finding any solutions within the time limit. The IMA method, 
on the other hand, shows very good scalability and performs like a constant-time 
problem solver for the range of problems in this table. One could speed up the process 
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of finding answers at the top area of the table by decreasing the value of Rounds. For 
example, 2000 rounds would be quite enough for the first 8 rows. 

An interesting observation is that having more agents is helping the system find 
solutions faster. This is to be expected, even though on a single-processor computer 
introducing m agents to solve the problem adds an overhead of 0(m). The reason is 
that, assuming a non-overlapping segmentation of the problem into sub-problems P„ 
each with a search space size of a„ then the whole search space is of size Tto,. This 
includes 0(a") or 0(n!) problems as special cases. While each of a, values may 
represent a small and manageable space, /7a, can represent a huge search space. We 
have thus converted the problem of searching a huge space to that of searching many 
smaller ones, even though the nature of the problem is unchanged. As is the case in 
combinatorial problems, reducing n by a little can dramatically speed up the process 
of finding a solution to that problem. This compensates for the effects of having more 
than one agent. Table 2 confirms this, and shows that increasing the number of agents 
beyond a point will eventually have a rather small, negative effect on execution time. 
Here 101 partial solutions were in the work pool, and 30 runs were performed for 
each row. 



Vars 


Constraints 


Agents 


Rounds 


Timed out 


Run Time (ms) I 


Avg 


Best 


Worst 


Std Dev 




30 


5 


20000 


0 


26771 


23390 


63720 


9875 




30 


10 


20000 


0 


11958 


11860 


12080 


56 




30 


20 


20000 


0 


6678 


6260 


15380 


1647 




30 


40 


20000 


0 


7301 


5110 


10110 


1757 




30 


60 


20000 


0 


8022 


5380 


10050 


1626 




30 


80 


20000 


0 


8255 


5820 


10550 


1732 




30 


100 


20000 


0 


8651 


5930 


10820 


1556 



Table 2. The effects of increasing the number of agents on the execution time. 



One observation from the results of the runs with a high number of agents was that 
because of the value of Rounds, in many cases a partial solution that was invalid at the 
start, and was not visited by all the agents, turned into a total solution. One reason for 
this is that these randomly generated partial solutions had some of their sub-problems 
already solved. The other reason is that due to the high number of agents and the 
resulting redundancy in the system, other agents were able to move the partial 
solutions towards total solutions. Either way this hints at the ability of the system to 
tolerate faults. More fault tolerance can be achieved by using more than one 
workpool, preferably in different computers, so the crash of one machine will not 
destroy all the partial solutions. 



4 Conclusion 

We proposed the Iterative Multi- Agent (IMA) problem solving method that involves 
the following activities: dividing the problem into sub-problems, dividing the 
knowledge to solve the sub-problems among multiple agents, and having a set of 
partial solutions. Then an iterative process starts, in which agents make changes to the 
solution of each sub-problem if necessary. The design of the system mimics the 
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behavior of human experts collaborating to solve a problem. But here more than one 
possible solution is considered. Each expert looks at the problem from his 
perspective, and tries to solve his sub-problem by using the resources made available 
to him. Conflicts may arise, and then each expert has to modify his part of the plan in 
the hope of making it compatible with the other parts. This seems to be an intuitive 
approach that works in practice. The two more important advantages of following this 
guideline are a more natural mapping process for problems that can easily be 
expressed in terms of interacting sub-problems, and in a reduction of the search space 
size. One disadvantage of the IMA method is that in general there is no guarantee that 
a solution will be found. Another disadvantage is that it may not be possible to find a 
globally optimum solution when there is no authority with knowledge about the 
whole problem. This could be tolerable in hard problems where finding a solution at 
all is good enough. 

This paper gave a practical example of co-operating agents that run in parallel and 
take part in solving a CSP. The source code of the implemented system (SCSPS.java) 
can be obtained freely by anonymous FTP from orion.cs.uregina.ca/pub/scsps or by 
contacting the author. 

For future we intend to apply the IMA method to other problem areas. Studying the 
effects of changing and inconsistent knowledge in the agents is also of interest to us. 
Making the system utilize a multi-processor machine or run over a network are other 
worthwhile efforts, as they enable the tackling of even bigger problems. 
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Abstract. A commonly used relational learning system (FOIL) is extended 
through the use of cliches, which are known to address FOIL'S greedy search 
deficiencies. The issue of finding good biases in the form of cliches is 
addressed by learning the cliches. This paper shows empirically that such biases 
can be learned in one domain and applied in another, and that significant 
improvement in accuracy can be achieved in this setting. The approach is 
applied to a real-life problem of learning finite element method structures from 
examples. 



1 Introduction 

Inductive learners that use a first-order language to express examples, background 
knowledge and hypotheses (or concept descriptions) are called inductive relational 
learners. Because they induce hypotheses in the form of logic programs they are also 
called inductive logic programming (ILP) systems. 

Given a set of positive and negative examples of a concept and possibly some 
background knowledge, ILP systems find a logic program such that every positive 
example is “covered” (subsumed) by the program and no negative examples are 
covered by it. The training examples and the background knowledge are also 
expressed as logic programs, with additional restrictions imposed on each one. For 
instance, training examples are typically represented as ground facts of the target 
predicate, and most often background knowledge is restricted to the same form. 

Inductive relational learners search for the hypothesis either in a bottom-up or in a 
top-down manner. Bottom-up approaches search from specific examples to a general 
hypothesis. They start from training examples and search the hypothesis space using 
generalization operators. GOLEM [12] uses a bottom-up approach. Top-down 
approaches search from the most general hypothesis to a specific hypothesis using 
specialization operators. This kind of search can easily be guided by a heuristic. 
Learning systems using a top-down approach include FOIL [15], LINUS [6] and 
FOCL [13]. 

Top-down ILP learners such as FOIL and FOCL learn Horn clauses one literal at a 
time until no negative examples are covered. Each clause is generated by adding one 
literal at a time using a greedy search algorithm. At each step the coverage of the rule 
after adding a literal is tested on training examples. The literal that best discriminates 
the remaining positive and negative examples is added to the current clause. The 
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clause is complete when no negative examples are covered by the clause. These 
systems suffer from myopia. This arises when the best discrimination would be 
obtained by adding more than one literal at once. Consider the concept of cup that is 
something that has a handle, a flat bottom, and a concavity pointing upward. 

cup(X):- partof (X, Y) , handle (Y), 

partof (X, Z) , bottom(Z), flat(Z), 
partof (X, W) , upward_concavity (W) . 



Based on a greedy search for a single literal, the relation partof (x, Y) by itself 
is not likely to distinguish cups from non cups (both cups and non cups have parts) 
and hence would not be added by such an algorithm to the definition of a cup. On the 
other hand, the conjunction of literals partof (x, Y) , handle (Y) is likely to 
distinguish some cups from non cups and should be added to the definition. The 
problem becomes the search for combinations of literals rather than just single literals. 
Unfortunately, trying all possible combinations of literals can be intractable. A 
mechanism to search efficiently through the space of combinations of literals is 
needed. A learner can be provided with such a mechanism in form of a special- 
purpose bias. 

We propose to learn combinations of literals automatically as a particular type of 
bias. These combinations of literals are called relational cliches. Unlike Silverstein’s 
relational cliches [16], our cliches implicitly represent both the pattern of predicates 
and variables and restrictions on them. The underlying ideaAsJoJearn cliches from 
examples of a concept and to use them across domains ure_J|). To learn such 
cliches, we have developed and implemented the learning system called CLUSE 
{Cliches Learned and Used) [9]. Examples are generalized in a bottom-up manner 
using CLGG (Contextual Least General Generalization) [11]. Resulting 
generalizations are expressed with the literals (i.e. predicate symbols and their 
arguments) specific to the domain. These generalizations are further generalized into 
domain-independent cliches, where predicate symbols are turned into variables that 
represent patterns of literals with implicit restrictions on their predicates and 
ar guments and can be used across domains. Examples of relational cliches are shown 
in 



I [Table 1 



Relational cliches 




Figure 1. Relational cliches used as a transfer of knowledge across domains. 



When relational cliches are transferred across domains they are associated with a 
list of instantiations. An instantiation is made with all combinations (without 
repetitions) of its literals’ instantiations (one instantiation for each literal at a time). 
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Literals in cliches are instantiated with literals in the domain of the target concept {i.e. 
predicates with the same arity that are used in the examples). Examples of 
instantiations are shown later. 

To show that the transfer of knowledge addresses the myopia problem of a learner, 
we extended FOIL [15] to learn concepts with cliches or without cliches (called 
xFOIL-CLICHES). We use cross-validation to compare the hypothesis accuracy of 
concepts learned with and without cliches. Cliches are tested on two domains of 
application. One domain is the real-life application of the finite element methods 
(FEM). The other domain is the synthetic domain of blocks, which offers a wide 
variety of problems (or concepts). Instances in that domain are automatically 
generated using a domain-independent examples generator called GenEx. 



Table 1. Examples of relational cliches learned in the blocks domain, where P# are variable 
predicates (and P2 • P3). 



# 


Relational cliche 


1. 


PI (X, Y) , 


P2 (Y) 


2. 


PI (X, Y) , 


P2 (Y) , P3(Y) 


3. 


PI (X, Y) , 


P2 (X) , P3(Y) 


4. 


PI (X, Y) , 


P2 (X) , P3(Y), P4 (Y) 



2 The FEM and the Blocks Domains 

Concepts in the blocks domain represent geometric problems and are similar to the 
Bongard problems used in simple pattern recognition problem [2] and classification 
tasks, often used in standard intelligence quotient (IQ) tests. Concepts in that domain 
offer a wide variety of problems that can be automatically generated using GenEx 
[10]. 

FEM is a numerical method to analyze stresses and deformations in physical 
structure s (or models). A model is described as a collection of edges (or meshes) 
(^igme^. An edge is characterized by the number of finite elements (EEs) on it. The 
basic problem during manual FEM design is the selection of the number of EEs, given 
an edge. This number is a property of the edge as well as of the structure of the model 
(other edges). Determining the number of EEs, therefore, is a relational problem. 
Considerable expertise is required in determining the appropriate number of EEs since 
edges are affected by several factors, including the shape of the model, loads and 
supports. So, to design a new model, it would be useful to have design rules that 
determine the number of EEs on its edges. Machine learning can be used to learn 
these design rules. A rule can be learned from edges with the same number of EEs 
occurring in known models. Rules could then be saved in a knowledge base and used 
in a FEM design expert system to determine the number of EEs on edges of a new 
unseen model. Results from different experiments with FEM models [7, 5] were 
encouraging enough to believe that the derivation of a knowledge base using 
automatic learning is a feasible approach for this domain. On the other hand it was 
observed that some kind of look ahead is needed; the relation neighbour is not 



382 



Johanne Morin and Stan Matwin 



induced in clauses, which means that essential information is not taken into account 
[7]. The relation neighbour is important because it expresses the relational character 
{i.e. the number of FEs of a given edge). Whatever FEM we are learning may depend 
on the relationship of this edge and other edges. On the other hand, it is unlikely that a 
literal like neighbour would be added with a learner like FOIL since it occurs in 
positive and negative examples. 



2.1 The Learning Problem 

Examples are derived from ten models. Each model is described as a collection of 
edges. Each edge results in a corresponding positive example stating the number of 
FEs on it. Examples are described with the relation: mesh (E, N) , where n is the 
recommended number of FEs along edge e. The FEM design problem can be stated as 
follows. Positive and negative examples are given as input. Positive examples are 
edges with the same number of FEs. Negative examples are positive examples of all 
other concepts. For instance, mesh (E, 1 ) are the positive examples for the concept 
of edges that need 1 EE, and mesh (E, 2), mesh(E, 3),... mesh(E, 8). are the 
negative examples for that concept. Each edge of a FEM model has several attributes 
that influence the resolution of a EE and which are part of the background knowledge. 
The output, is a design rule that describes edges with a set of disjunctive (Prolog) 
rules covering as many positive examples and as few negatives as possible. For 
instance: 

mesh(X, 1 ):- usual (X) , free (X) , opposite (X, Z) , 
usual (Z) , short (Z) . 

where x is the edge that needs 1 EE (which is given in the starting clause), while z is 
some other edge in the model. Once learned, the goal mesh (E, N) is called where e 
is bound and N is not. The number of FEs on the given edge is assigned to n using the 
learned design rules. 




Figure 2. Labelled edges of a FE structure. 
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3 XFOIL- CLICHES Learns with or without Cliches 



3.1 An Overview of FOIL 

The FOIL learning system [15] has been widely used by machine learning researchers 
at large. FOIL extends some ideas from attribute-value learning algorithms to the ILP 
paradigm. In particular, it uses a covering approach similar to AQ’s [8]^ and an 
information based search heuristic similar to ID3’s [14]. For complete examples of 
applying FOIL, see [7, 1]. 

As explained earlier, FOIL learns constant-free and function-free program clauses, 
which means that no constants or terms other than variables may appear in the 
induced clauses. Function-free ground facts (relational tuples) are used to represent 
both the training examples and the background knowledge. FOIL learns a program 
one clause at a time until it covers all positive examples. Each clause is generated by 
adding one literal at a time. At each step the coverage of the rule after adding a literal 
is tested on training examples. The literal that best discriminates the remaining 
positive and negative examples is added to the current clause. The discrimination of 
examples is evaluated with an information gain. The clause is complete when no 
negative examples are covered by the clause. If FOIL exhausted all choices for a 
literal and the clause still covers negative examples then FOIL fails and stops. 



3.2 XFOIL-cliches Searches for Combinations of Literals 

XFOIL is our Prolog implementation of FOIL, extended to learn with cliches (in this 
case we refer to it as xFOIL-CLICHES). xFOIL-CLICHES searches for cliches only 
when FOIL fails to learn a clause i.e. when no single literal has positive information 
gain, yet the clause still covers negative examples. xFOIL-CLICHES adds a cliche 
when cliches provide a pattern of variables that — when instantiated to literals in the 
target domain — have a gain for the concept. 

xFOIL-CLICHES searches for variabilized instantiations of cliches in the same way 
that xFOIL searches for variabilized literals to add to a clause. After a few 
experiments, particularly in the FEM domain, restrictions on the search space for 
cliches were added to xEOIL-CLICHES. These restrictions prune the search space 
when the number of instantiations would result in too many variabilizations (i.e. 
choice of variables for a predicate). Eor instance, if the number of cliches is less than 
or equal to ten, xEOIL-CLICHES allows a maximum of approximately five hundred 
instantiations. Otherwise the maximum is a hundred. Restricting instantiations means 
that only the first instantiations of literals are preserved, according to the order in 
which literals are given to the learner. These restrictions limit the CPU time, and have 
very little effect on learned rules. Pull descriptions of these restrictions can be found 
in [9]. 




384 



Johanne Morin and Stan Matwin 



4 Experimental Setting 

The experiment is set up as follows. Independently, CLUSE learns relational_clic^s 
from examples and xFOIL learns hypotheses with or without cliche||] (^igme_^. 
xFOIL uses a training set to learn the hypothesis, and a separate testing set to evaluate 
the accuracy (or error^of this hypothesis over subsequent examples. To evaluate the 
effect of the transfer of knowledge (in the form of cliches) between domains, a 3-fold 
cross validation is used to compare a concept learned with xFOIL (without cliches) to 
the same concept learned with xFOIL-CLICHES. The two learning algorithms are 
compared on the accuracy and the compactness of the learned hypothesis, and on their 
cost of application. The cost of an algorithm measures the number of search steps to 
learn a concept. For simplicity the cost is shown as the CPU time. The compactness of 
a hypothesis is the number of rules per hypothesis, the number of literals and the 
number of variables per rule^ 




Figure 3. Independently from each other, CLUse learns relational cliches and xFOIL learns 
hypotheses with or without cliches. xFOIL uses a training set to learn the hypothesis, and a 
separate testing set to evaluate the accuracy of the learned hypothesis. 



* One could imagine that cliches are preserved in a knowledge base and used with xFOIL 
whenever they are needed. 

^ The accuracy is equal to i - error. Both measures are used in this paper. 

^ Either of these measures or a sum of all of them could be used in a particular evaluation. 
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To show that the transfer of knowledge across domains facilitates the learning of 
FEM concepts, we first apply xFOIL (without transfer of knowledge) to learn FEM 
concepts. Second, we show learned rules for one of the FEM concepts with the 
transfer of knowledge. Relational cliches from the blocks domains (shown in^igurejj 
are used for this experiment. We then compared the results of xFOIL with xFOIL- 
CLICHES on learning FEM concepts. We also discuss the results of transferring 
knowledge to learn concepts in the blocks domain for which xFOIL is unable to learn 
a single literal to discriminate positive and negative examples. 



5 Results and Discussion 



According to the standard ILP practice for the FEM domain, our experiment consisted 
of learning eight design rules (or definitions). One rule is learned for each FEM 
concept, where a concept corresponds to all edges with the same number of FEs on 
them. On average FEM concepts are learned on 36 positive examples and 253 
negatives, and tested on 18 positive examples and 126 negatives (not shown). xFOIL 
performs very well (with only 8% error on testing sets) for two of the eight FEM 
concepts {i.e. FEM-7 and FEM-8^ This means that for these two tasks xFOIL- 
CLICHES would need almost perfect performance to achieve significant improvement 
(at most 1% of error at 0.05 significance). Therefore, these two concepts are removed 
for the next part of the experiment. For the other concepts, xFOIL is cheap and 
compact. On average xFOIL learns concepts in 4 minutes of CPU time. xFOIL learns 
13 rules of 3 literals describing 2 variables (where one variable represents an edge and 
the other one represents the number of FE on edges). On the testing sets, learned rules 
make 26% error (covering 82% of the positive examples and 29% of the negative 
examples). As expected, the error is high because xFOIL has difficulty using relations 
opposite and neighbour. xFOIL uses the relation opposite rarely, and almost never 
uses the relation neighbour. These relations occur in all positive and negative 
examples so they do not discriminate (hence have no gain) between positive and 
negative examples. 

xFOIL-CLICHES learns more accurate rules for FEM concepts than xFOIL. The 
transfer of knowledge from the blocks to the FEM domain allows xFOIL-CLICHES to 
add combinations_o£Jiterals (or instantiation of a cliche) for more than half of the 
learned rules, figure 4| shows some rules learned for FEM-3 with xFOIL-CLICHES. 
For 9 out of 15 rules learned for FME-3, 8 were instantiations of cliche-2: Pl (x, 

I and the other was an instantiation of cliche-3: PI (x. 



P3(Y) 



Table 1 



Y), P2(Y), 

Y), P2(X), P3(Y). The other two cliches are not used: cliche- 1 being too general 
and cliche-4 too specific for that concept. In two rules (2 and 3) xFOIL-CLICHES adds 
at least one literal after a cliche is added to the rule. In rule 2, short (z) is added to 
describe the new edge introduced by the cliche. As in rule 2, not_loaded (z) is 
added in rule 3 to describe the new edge introduced by the cliche. In this rule and 
unlike any other rules, xFOIL-CLICHES adds other literals to describe a third edge (w) 
related to the edge introduced by the cliche (i.e. neighbour (z, w) , 
long_for_hole (W) , free (W) , not_loaded (W) ). 



A closer look at the learned hypothesis reveals that xFOIL was able to learn the relation 
opposite for these concepts. Hence, did not suffer from myopia as the FEM concepts. 
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1 mesh (X, 3 ) 



usual (X) , not_loaded (X) , free (X) , 
[neighbour (X, Z ) , free(Z), long(^. 



usual (X) , not_loaded (X) , 


neighbour (X, Z) , 


fixed(X), one_side_loaded (Z) 


, short (Z) . 



3 mesh (X, 3):- usual (X) , fixed (X), not_loaded (X) 



neighbour (X, Z) , one_side_f ixed (Z) 



short ( Z ) j , not_loaded ( Z ) , 
long_f or_hole (W) , free (W) 



neighbour (Z, W) , 
not_loaded (W) . 



Figure 4 . 3 o ut of the 15 rules learned for FEM-3 with xFOIL-CLICHES. Cliches (represented 
within a jboxj ) are used when xFOIL is unable to find a single literal with a gain and the rules 
still cover negative examples. After a cliche is added to a mle xFOIL-CLICHES looks for other 
literal to add to the rule. 



The transfer of knowledge from the blocks domain allows xFOIL-CLICHE S to learn 
FEM concepts with 20 combinations of literals (or instantiations of cliches) (jTable^. 
xFOIL-CLICHES learns more rules to cover all positive examples (20 rules instead of 
13), because learned rules are more specific. In fact, rules are expressed with more 
literals (5 literals instead of 3 with xFOIL) and they cover fewer examples each (65% 
for positive examples instead of 82%, and 11% for the negatives instead of 29% with 
XFOIL - not shown). Learning with a transfer of knowledge improves the accuracy of 
rules (87% compared to 74% with xFOlL), and only introduces rules that increase the 
overall accuracy. 



Table 2. The number of rules learned, the number of cliches (# cliches) used and the CPU (in 
minutes) required for the learning. For testing, the positive and negative examples covered (PC 
and NC) as well as the resulting error (Frr) for the learned hypothesis are shown. All numbers 
are averaged over the partitions of the cross-validation method. The rightmost column shows 
the difference in error between xFOIL and xFOIL-CLICHES ± the standard deviation. Values 
marked with an * represent a significant difference in favour of xFOIL-CLICHES. 





Learning 








Testing 




FEM 


# rules 


# cliches 


CPU 


PC% 


NC% 


Err% 


Improvement 


1 


31 


23 


13 


76 


12 


16 


*0.17 ± 0.12 


2 


38 


23 


146 


67 


23 


25 


0.15 ±0.18 


3 


16 


22 


71 


68 


14 


16 


*0.12 ± 0.07 


4 


12 


21 


45 


81 


9 


10 


*0.10 ±0.02 


5 


9 


16 


36 


42 


5 


7 


0.09 ±0.17 


6 


11 


13 


19 


57 


4 


7 


0.11 ±0.15 


Avg. 


20 


20 


72 


65 


11 


13 





|TaWe2|also shows that xFOIL-CLICHES performs significantly better than XFOIL 
on testing examples for half of the FEM concepts. Using cliches improves xFOIL’ s 
accuracy, since it decreases the number of misclassified examples for each concept. 
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This signifies an improvement from xFOIL to xFOIL-CLICHES and means that the 
data provide enough evidence to say that xFOIL-CLICHES performs better than xFOIL 
on testing sets for all FEM concepts and it performs significantly better for FEM-1, 
EEM-3, and EEM-4 (shaded values in the last column). 

The improvement provided by cliches to xEOIL has a cost. Eor each clause, 
xEOIL-CLICHES starts by searching for single literals. When none of them can 
improve the gain of the clause xEOIL-CLICHES searches for cliches and then again for 
single literals if the clause still covers negative examples. Compared to the number of 
search steps used in xEOIL, xEOIL-CLICHES requires 13% more steps to find single 
literals (from 178 to 200 steps). Unlike xEOIL, the number of steps to search for 
cliches (or variabilized instantiations of cliches) must also be taken into account 
(2210 steps). As a consequence, the CPU time increases from 4 minutes to 72 when 
cliches are used. Because of this, the search space needs to be restricted in some way. 
We ran another experiment in the EEM domain where the search space was restricted 
with to the ten most frequently used literals. EEM concepts show an improvement in 
using xEOIL-CLICHES over xEOIL on two thirds of the EEM concepts. 

Eor our experiments in the blocks domain, concepts for which xEOIL was unable 
to learn a single literal were generated. Experiments in that domain showed that with 
concepts for which xEOIL learns no hypotheses, the cliches always provide an 
appropriate transfer of knowledge. 



6 Conclusion 

Empirical evaluation revealed that cliches learned with CLUSE provide appropriate 
transfer of knowledge across domains to address the myopia problem of a learner that 
uses a greedy search algorithm. xEOIL-CLICHES was used to compare hypotheses 
induced with cliches to hypotheses induced without cliches. Relational cliches 
expressed with variable predicates provide a pattern of variables (showing which 
variables are shared by literals). Experiments showed that cliches often significantly 
improve the accuracy of the hypotheses and in the worst case, the accuracy is never 
worse than if no cliches were provided. In the blocks domain, experiments showed 
that with concepts for which xEOIL learns no hypotheses, cliches always provide an 
appropriate knowledge transfer. In the EEM domain, cliches learned from a blocks 
concept were useful to learn 50% of the EEM concepts. A first experiment used an 
arbitrary ordering of literals (Le. the same ordering of literals as was given to the 
system). A second experiment showed that when the search space for cliches was 
restricted to the most frequently used literals in that domain, cliches were useful for 
66% of the EEM concepts. 

The results indicate that ILP learners could benefit from transfer of knowledge 
across domains. Other problems that could benefit from transfer of knowledge across 
domains are drug design [17], text categorization [3] and detecting traffic problems 
[4]. Cliches would be learned from a domain at hand to be fetched from the library to 
learn concepts in a “similar” domain. How to define and measure “similarity” of 
domains is an open problem. In general, the transfer of knowledge could be used 
whenever an ILP learner fails to give practical results. 
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Abstract. In this paper, I will argue that agents with simple affective inner 
states (that can be interpreted as “hunger” and “mood”) can have an advantage 
over agents without such states if these states are used to modulate the agents’ 
behavior in specific ways. The claim will be confirmed using results from 
experiments done in a simulation of a multi-agent environment, in which agents 
have to compete for resources in order to survive. 



1. Introduction 

Emotions, despite their importance for natural cognitive systems, have been ignored 
in artificial intelligence (AI) research for a long time, possibly because classical AI 
was focused primarily on different ways of processing symbols to achieve cognitive 
tasks.[] Up to the early 80ies, the emphasis was on representations and how they can 
be manipulated^ Only after a shift in attention from “disembodied” and 
“disconnected” to “embodied” and “situated” AI, that is, from focussing on the mere 
abstract processing to focussing on the actual performing of actions, has AI started to 
recognize what animal behaviorists a ethologists (e.g. see [11]) kept stressing again 
and again, that emotions play a crucial role in the processing of information and the 
regulation of behavior^ 

There is quite some evidence that many animals seem to have “moods” or at least 
certain “modes” related to important survival functions, and these moods or modes 
can be conditioned to environmental stimuli^ Actually, much evidence indicates that 



* See [14] for example, for a discussion of the “sense-think-act” cycle in the field of 
autonomous agents. 

^ Only a few reseai'cher have stressed the importance of emotions, even to the point that they 
were willing to attribute emotions to simple behaviors (e.g., see [4]). 

^ I still think that emotions have not yet received the attention in AI that they deserve. E.g., 
Arkin’s great overview of behavior-based robotics dedicates only two pages to emotions (see 
ch. 10.1 of [1]). 

E.g., see [17] for feeding behavior, [8] for defense behavior, [7] for sexual behavior, or [9] for 
a review of some of the issues related to moods and learning. All of this research is in line 
with the views of the classical ethologists [10] and [18], in particular, the views that the 

H. Hamilton and Q. Yang (Eds.): Canadian AI 2000, LNAI 1822, pp. 389-399, 2000. 
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the power of these moods is often great enough to produce extremely maladaptive 
behaviors, when the animal is taken away from its natural habitat (e.g., in a laboratory 
setting, see [9], or [3]). 

Although there are some recent investigations studying agents with emotions (e.g., 
the KISMET project at MIT, or [12]), not much is known about how emotional states 
can be employed to make agents more adaptive. Note that “adaptive” does not 
automatically imply “learning”, as an agent might well have a fixed, unalterable 
architecture, yet might be able to alter its behavioral responses to the same 
environmental stimulus according to the status of some inner states (where these 
states are also sometimes called “proprioceptive” states, e.g., see [15]). These inner 
states provide additional input to modules that act upon stimuli, thus allowing for 
more complex decisions to be made without having the need of more sense organs. 
Furthermore, some of these states are quite simple to maintain, yet can serve a wide 
range of functions in the control of an agent (e.g., a state corresponding to the 
“hunger”-level of an agent can have tremendous influence on its behavior; at the same 
time its activation is indirectly proportional to the charge level of a battery in an 
autonomous robot, say). 

It is the purpose of this paper to show that simple emotional states can facilitate 
adaptive behavior (which otherwise might be difficult to accomplish, probably only at 
the expense of adding additional behavioral modules). I will refer to these states as 
“affective states”, as I do not want to get involved in a discussion as to whether robots 
can experience emotions^ The states I will describe do, however, come close to what 
is sometimes called primary emotions (such as startled, terrified, sexually stimulated, 
etc. see, for example, [16]). 

In the following I will first spell out my claim in more detail and motivate the 
experimental setup, i.e., the computer simulation used to argue it. Then I shall 
describe the design of the simulation, in particular, that of the involved agents in 
detail, concluding with a summary of the results and their implications for AT 



2. The Power of Simple Affective States 

There are various ways of elucidating how inner states can influence the behavior of 
agents. In fact, it does not take much to see that agents with inner states are in some 
sense “more powerful” than agents without inner states: the latter can keep track of 
states of affairs in the environment, whereas the former cannot. Unfortunately, inner 
states have mostly been used to keep track of states of affairs that are external to the 



overall behavior of an agent can be understood as the interplay of many less complex 
behaviors that are arranged hierai'chically. 

^ I am not only reluctant to call these states “emotional states”, because they do not correspond 
to any of the widely accepted emotional states such as love, anger, faith, distrust, etc., but 
also because I think that one has to distinguish between “having these states” and 
“experiencing them”. Usually, emotional states are associated with experiencing them as 
such, yet, in order to experience them, they have to be the experience of somebody and I do 
not believe that there is somebody present in these primitive agents for whom they could be 
experiences (interestingly, there are people who are willing to ascribe genuine emotions to 
even very “unemotional” behaviors, see [13], for example). 
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agent (and most of the examples in standard AI textbooks use such external factors). 
Much less attention has been devoted to inner states keeping track of internal states of 
affairs, that is, of information that is directly available inside the agent and often 
times most relevant to its proper functioning (or call it “survival”, if you will). For 
example, consider a simple reactive agent, which consists solely of a sensor and a 
motor system that are connected in a certain way so as to exhibit a certain behavior (a 
reflex, for example). The sensor-motor-system implicitly divides the agent’s world 
into two categories: stimuli that trigger the reflex, and stimuli that do not trigger it. 
Suppose, the reflex is thought to move the agent into a direction different from the 
current (e.g., away from an obstacle it might bump into). Because the purpose of the 
reflex (assuming the agent is an natural kind and the reflex is the product of evolution, 
or if the agent is an artifact that the reflex has been purposefully designed to prevent 
damage to the agent) is to prolong the life of the agent, it is “good” for the agent if the 
behavior is not evoked by an external stimulus. In other words, not having to react is 
good, having to react is bad. Having to react many times within a short period of time 
is very bad (as it implies that the environment is cluttered with obstacles or that the 
agent is making stupid moves). So, an internal state that integrates the stimulation of 
the sensor over time might serve as a measure of how much the agent “likes” the 
current environment. Or putting it less intentionally, such a state, connected to the 
motor system with inhibitory connections, could make the agent more reluctant to 
leave a “safe” environment. Note that this kind of state turns an implicit goodness 
measure into an explicit one. It allows the agent to explicitly monitor what otherwise 
is only implicitly given. Furthermore, such a state might be very easy to realize. In 
the above example a simple neuron with its axon connected via an inhibitory synapse 
to the motor system, performing temporal integration will suffice. 

In the following, I will describe an experimental setup that is intended to 
demonstrate and confirm the above reasoning. 



3. The Simulation 

To make that case that agents with affective states outperform agents without such 
states, a setup is required, in which agents of both kinds can interact. Ideally, I would 
have liked to use real robots to test this claim as I believe that “the world is its own 
best model” [5]. Unfortunately, this was not possible under resource constraints. So, 
a simulation of an artificial multi-agent environment had to serve as the touchstone 
instead. 



3.1 The Environment 

The simulated environment (the “world”) consists of a 40 x 40 grid. Each cells can 
host at most one of four kinds of objects at any given time: agents (with or without 
affective states), energy sources, moving, and static obstacles. While agents and 
energy sources can occupy only one square at any time, obstacles may occupy an 
arbitrary number of squares. Moving obstacles move at a constant speed in a 
predetermined direction (and “wrap around” the confines of the environment). Static 
obstacles and energy sources are stationary. Energy sources store a fixed amount of 
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energy. They are created spontaneously at random locations within the world and 
stay there for a pre-determined period of time, unless consumed by agents. A 
parameter of the simulation determines the frequency at which energy sources appear. 



3.2 The Agents 

There are, as indicated before, two kinds of agents, those with affective states, call 
them “affective agents” and those without such states, called “reactive agents”. Each 
agent, affective and reactive, is equipped with two sensors of limited reach: a “sonar” 
sensor to detects obstacles (including other agents and the boundaries of the 
environment, but not energy sources), and a “smell sensor” to detect energy sources 
(i.e., “food”). Sensors can be viewed as generating a vector-field for each location in 
the environment, where the orientation of the vector at each position corresponds to 
the direction the agent should be heading in to either come closer or to avoid a 
particular region (see [2]). In the case of energy-sources, the vector indicates 
attraction, whereas in the case of obstacles the vector indicates repulsion, the amounts 
of which are indicated by the magnitude of the vector. 

Agents can use a combination of these vector fields to obtain the direction, in 
which they should be heading to find energy — this is where the difference between 
affective and reactive agents comes to bear: while reactive agents use a fixed 
weighted sum of both vector-fields (i.e., the gain in the resepctive schemas is given, 
see [1]), the affective states in affective agents can influence the combination of the 
vector-fields, thus leading to different combinations at different times depending on 
the status of the affective states. Two simple affective states, which I will call 
“hunger” and “mood”, are implemented in affective agents. These states receive 
input from an additional sensor, the “energy level sensor”. Note that this sensor, as 
opposed to the other two, is an internal sensor monitoring the state of the internal 
energy store (the batteries, say). 

Agents can only move straight one square at a time in the direction they are 
heading (i.e., in one of the eight possible directions as determined by the eight 
surrounding squares). In order to move to a surrounding square that is not straight 
ahead, an agent must first turn in the new direction, then it can perform the move into 
the square. By design, agents can never occupy a square of the border of their world. 

Each agent looses energy as time passes by and thus has to find energy sources in 
order to survive. While turning their head (i.e., changing the direction of movement) 
does not cost any more energy than staying still, moving results in an additional loss 
of three times the amount. Agents can obtain new energy by moving over energy 
sources, which will store the amount of energy provided by the energy source in the 
agent and cause the depleted energy source to be removed from the world. When 
agents come too close to (static or moving) obstacles, other agents, or the border of 
their world (as determined by a preset parameter) in the direction they are heading, a 
“reflex”-like behavior will attempt to turn the agent into a direction, in which there 
are no obstacles. If such a heading can be found, reactive agents will make a 
“reflex”-motion either always at normal speed or always twice as fast, whereas 
affective agents can choose between both speeds depending on their inner states. 
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3.3 The Task 

Agents can be destructed in three different ways, in which case they are removed 
from the world: (1) they run out of energy, (2) they bump in other agents or obstacles, 
and (3) another agents or moving obstacles bump into them. The task of each agent is 
to avoid any of these circumstances and “survive” as long as possible. Simulations 
start with a certain number of agents of each kind and end after a predetermined 
number of updates, after which the number of remaining agents of each kind is 
counted. 




Fig. 1. The reactive layer shared by both kinds of agents and the additional affective layer of 
affective agents. Inputs to the reactive layer come from the external sonar and smell sensors, 
motor output goes to the moving and head-turning system. Inputs to the affective layer are 
provided by the energy level sensor, the reactive layer as well as the previous state of the 
affective layer, while outputs affect only both layers (they are not connected to any effectors of 
the agent). 



4. The Agents’ Design and Its Justification 



Both kinds of agents share the so-called “reactive layer”, their basic control structure. 
In addition, affective agents possess what I call the “affective layer”, which exerts 
influence on the reactive layer depending on affective states (see figure 1). 



4. 1 The Reactive Layer 

The reactive layer is realized as a set of finite state machines, which run in parallel 
and can influence each other (closely related to the style of Brooks’ subsumption 
architecture, [6]). It consists of the following four finite state machines (see figure 2): 

• SONAR 

• SMELL 

• COLLIDE 

• LORCE 

• LORWARD 
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Head 



Body 



Fig. 2. The internal structure of the reactive layer. Arrows indicate information flow between 
systems. “Head” indicates output to the agent’s head (which is used to change the current 
heading), “body” indicates output to the effectors that move the agent in the direction of its 
heading. 

Very briefly, the SONAR and SMELL systems monitor the sonar and smell sensors, 
respectively, and store direction and magnitude of their current vectors. LORCE 
combines the information from SONAR and SMELL, computes the weighted sum of 
both vectors according to some predetermined weight scheme, and provides the actual 
direction the agent should be heading in. LORWARD uses this information to move 
the agent in that direction (possibly after first reorienting the agent’s heading). 
COLLIDE is the most complex subsystem in that it continuously monitors the agent’s 
heading and checks (using input from SONAR), if something is right ahead of the 
agent (within some predetermined range). If so, it obtains control over the motor 
outputs by inhibiting LORWARD and stops the agent. It then initiates a random 
sequence of reorienting the agent. The agent remains stationary until a heading has 
been found, in which there are no immediate obstacles (using again SONAR). Then a 
“reflex” moving the agent one step in this very direction is executed, after which 
control over the outputs is released and passed back to LORWARD. The rationale for 
the reflex should be clear: despite the agent’s attempt to avoid obstacles (by 
integrating sonar readings in the LORCE module), its attraction to energy sources 
might still lead it on a collision course with an obstacle, if only the attraction to food 
is large enough. Also, a moving obstacle might cross the agent’s trajectory causing 
an equilibrium between the attraction to the energy source and the repulsion caused 
by the obstacle, thus leaving the agent “undecided”, i.e., stationary. Consequently, if 
the obstacle happens to be adjacent to the agent, the agent will get run over by the 
obstacle. Linally, if there is little energy available in the environment at any given 
time, agents tend to move towards the boundaries of their environment (by which they 
are least repelled) and the reflex prevents them from wandering off the confines of 
their world. Note that the reflex does not prevent agents from getting run over by 
obstacles or being “bumping into” by other agents (since there are again cases, where 
an agent — ^being cornered by other agents, for example — has reached an equilibrium 
point, is thus unable to move, and gets run over by a moving obstacle). 



4. 2 The Affective Layer 

The affective layer, as opposed to the reactive layer, is realized as a very simple 
recurrent neural network, which runs in parallel with the modules of the reactive layer 
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(although nothing essential hinges upon the fact that it is a neural network)^ The 
output of each unit is the weighted sum of its input units. The network proper 
consists of only two genuine (hidden) units — the two affective states — , the other 
units serve merely as inputs (“stopped” and “energy level”) and outputs (“reflex 
strength” and “food attraction”), respectively — see figure 3). 



Reflex strength Food attraction 



Mood 




Fig. 3. The neural network implementation of the affective layer. Lines with arrows indicate 
excitatory connections, lines with bullets indicate inhibitory connections. 



The input “stopped” is derived from the link between the COLLIDE and the 
FORWARD module in the reactive layer; if COLLIDE inhibits FORWARD, then 
this inhibitory input becomes active (expressing the agent’s preference to move out of 
threatening situations). The other input reflects the current energy level of the agent, 
where the actual value is fed into the “mood” unit (reflecting the fact that the agent is 
in a better mood, when its energy level is higher), while the reciprocal value 
“l/(energy level)” is fed into the “hunger” unit (reflecting the fact that the “hunger” 
level increases as the energy level decreases). 

Both, the “hunger” as well as the “mood” unit have self-excitatory connections, 
reflecting the idea that “being hungry” makes one even more hungry and “being in a 
good mood” tends to be self-sustaining (regardless of minor irritations). By the same 
token, “being in a bad mood” alone tends to even worsen the mood without further 
ado. The rational for the mutual inhibition of both units is that hunger contributes to 
one’s bad mood, whereas being happy tends to let one forget about hunger. 

Both hidden units have excitatory connections to the “reflex strength” parameter in 
the COLLIDE module, which determines, if the agent should move one or two fields 
at a time when performing a reflex action (recall from section 3.2 that this value in 
reactive agents is either set to 1 or 2 or is chosen at random each time a reflex is 
performed). The possibility of changing this value, i.e., of controlling the strength of 
the reflex depending on the agent’s mood (as opposed to having it fixed), allows the 
agent to react more forcefully if it is in a “bad mood”. Since being in a bad mood is 
directly related to hunger, an agent being caught between different repulsive forces 
will turn further away from obstacles when it is more hungry, thus increasing the 
likelihood of “picking up a scent” (that is, a trajectory that will lead it to an energy 
source) instead of the being forced to remain “blocked”. At the same time, the agent 



^ The weights of the connections were obtained experimentally. It seems to me that it should 
be possible to “learn” using some version of reinforcement learning. 
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will not waste energy on reflexes unnecessarily if it is happy (which means, as a 
consequence, that there is no immediate need for food), thus conserving energy. Note 
that an ongoing “block” situation alone can rapidly worsen an agent’s mood (without 
having the agent necessarily be hungry), and thus can also lead to more forceful 
reflexes. 

The other output is connected to the parameter “food attraction” in the FORCE 
module. This parameter is the weight associated with the vector from the smell 
sensor, which is used by the FORCE module to compute the weighted sum of the 
sonar and smell vector fields. Changing this value while keeping the positions of 
energy sources and obstacles constant, can change the trajectories of the combined 
vector field significantly: what was attractive before, might not be attractive any 
longer (and vice versa). It is, in my view, to a large part this possibility of controlling 
their trajectories (and also to some degree the above-mentioned control of the reflex 
strength) that accounts for the advantage of affective agents over reactive agents in 
the experiments described in the next section. 



5. Experiments and Results 

Before starting the actual experiments, various parameters of the environment had to 
be experimentally determined such as the number and sizes of stationary obstacles, 
the number, sizes and speeds of moving obstacles, the number of energy sources 
together with their energy capacities, frequency of appearance, and life time, etc. 
Once the various degrees of freedom of the environment had been fixed, it was 
possible to determine an appropriate number of agents to inhabit the environment as 
well as an appropriate time span for which to run the simulation (in order to see any 
effects of the agents’ interactions at all). The final figures for the various parameters 
used in all the subsequent simulations areQ 

• two stationary obstacles (of dimensions 3x5 and 5x6, respectively) 

• two moving obstacles (of dimensions 3x3 and 4x3, respectively), the first 
moving vertically, the second diagonally at a constant speed of 1/10-th the speed of 
agents 

• one energy source in the beginning 

• the capacity of energy sources fixed at 250 with life time approximately 400 time 
steps 

• new energy sources appear on the average every 30 time steps at random locations 

• agents start out with an energy level of 500 

• the duration of the simulation fixed at 2000 time steps 

It is worth pointing out that only reactive agents were used in the above simulations 
and that only in only 10 percent of all simulations did some reactive agents survive 
the end of a simulation. 



’ Note that there are many possible, appropriate parameter settings and that nothing essential 
hinges on the above choice of parameter values. 
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Using the above configuration, three experiments were run: in experiment A four 
reactive agents and one affective agent were placed at random safe locations in the 
environment (that is, at least one field away from any obstacle), in experiment B 
instead only one reactive and four affective agents were used. Finally, in experiment 
C, five affective agents had to compete with each other, with the main difference that 
they did not use predefined weights in the affective layer, but rather random 
connections, which they could alter using Hebbian learning. In this setup, agents that 
survived for 1000 time steps would reproduce at that point and generate one identical 
offspring. 

The purpose of experiment A was to test whether an affective agent could survive 
in an environment dominated by reactive agents, and furthermore, if the affective 
agent would do better than the average reactive agent. The purpose of experiment B 
was to see whether a small number of reactive agents could survive in an environment 
dominated by affective agents. The purpose of experiment C was to check whether 
agents with the same computational resources would do equally well or equally badly, 
and furthermore, whether associative learning would lead to a configuration in the 
associative layer similar to the “hard-wired” setup of the affective agents. 

Simulations typically start out with some agent consuming the only energy source 
present, followed by a migration of most agents to the borders of the environment (out 
of lack of energy sources). Affective agents, being “content and full” in the 
beginning do not go after food at all, rather they avoid energy sources, thereby also 
avoiding contact with other agents, and thus reducing the risk of bumping into each 
them. After a while affective agents will get moody and hungry, thus becoming more 
attracted to food, which eventually leads to very aggressive “scent following” 
behavior (if an affective agent does not find food in time or has been stuck for quite 
some time in a particular location). Sometimes this behavior will worsen an agent’s 
situation (and eventually lead to its destruction). By and large, however, affective 
agents will only compete for food if they are “in a bad mood and hungry”, otherwise 
they will not participate in the competition for energy resources. Reactive agents, on 
the other hand, maneuver themselves regularly into predicaments because of their 
constant interest in food, which often forces them into situations where contact with 
other agents and/or obstacles is inevitable, eventually resulting in the destruction of at 
the involved agents. 

All experiments consisted of 20 simulations. In experiment A, the affective agent 
survived in about half of the simulations and so did the reactive agent in experiment 
B. However, only one reactive agent survived in experiment A on the average, 
whereas more than two affective agents survived on the average in experiment B. 
This is what we should expect: by reducing the risk of being forced into no-win 
situations, a larger population of affective agents than of mere reactive agents can 
survive in a given environment. The chances of survival are better for affective 
agents because of their being able to control their own trajectories through the 
environment, whereas reactive agents cannot help chasing after food (even if they 
have enough energy). Experiment C, finally confirmed that using the two hidden 
nodes as “affective states” (or at least as states that serve the role of affective states or 
can be interpreted as such) is advantageous — some agents, whose initial random 
weights in their affective layer reflected a setup similar to the hard-wired affective 
layer (with variations only in magnitudes and all the signs) would eventually “learn” 
through experience to avoid food if the energy level was high and seek food 
aggressively if energy was low. Only such agents eventually survived in experiment 
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C, which implies that the additional computational means of the affective layer are of 
any use only if they alter the perception of food depending on the inner state. 
Interestingly, while one of the two hidden nodes always reflected something like the 
“hunger level”, it was sometimes difficult to interpret the role and/or meaning of the 
other (actually, there were agents where this other state was not used at all, leading to 
the conclusion that in this particular setup one affective states, namely “hunger” 
would have sufficed). 



6. Discussion 

What the experiments confirm, is what seemed intuitively clear from the beginning, 
namely that avoiding confrontations, unless really necessary, is a better strategy for 
survival than ignoring the risk that competition for resources bears on its sleeves. 
Since it is the affective states that regulate the behavior of affective agents, the above 
experiments confirm the original claim that affective states are an advantage in the 
competition for resources, and eventually for survival. It was also shown that the 
additional computational power is only useful if it is used in a setup that resembles the 
hard-wired affective layer, and thus suggests that the hidden nodes serve the role of 
affective states. 

Note, however, that it can be experimentally confirmed that a single reactive agent 
shows better performance than a single affective agent (in a slightly altered 
environment), since once the risk of competition is excluded, the negligence of food 
of affective agents (when they are not hungry) can haunt them; they might happen to 
look for at a time, when no food is available, while reactive agents take whatever they 
can get at any time (thus being implicitly “more foresighted” by storing energy in 
advance). This is, in my view, further evidence for the biological plausibility of the 
model; affective agents are best suited for competitive environments, where avoiding 
threatening competitions is advantageous in the long run; taken out of their “natural 
habitat” and put in a less favorable environment, their performance will decrease (see 
also [3]). 

Finally, I would like to point to an interesting open problem; the experiment 
suggests that a large population of affective agents can support a small population of 
reactive agents — what is the maximal population size of reactive agents that can be 
supported by a given population of affective agents? My hunch is that a satisfactory 
answer to this problem might even be able to shed some light on the actual evolution 
of affective agents (out of mere reactive agents), that is, on the evolution of ajfective 
states. 
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Abstract. As the number of applications available on the World Wide 
Web (WEB) increases at a rapid speed, an enormous number of resources 
become available to the general public. These resources offer information 
and services in a variety of domains but are often difficult to use due 
to their idiosyncratic domain and interaction models. In this paper, we 
discuss a multi-agent architecture for integration of WEB applications 
within a domain, based on a task-structure approach. The architecture 
consists of a set of wrapper agents, driving and extracting information 
from a set of corresponding WEB applications, and a mediator agent, 
whose task structure drives both its interaction with the users and its 
communication with the wrappers. We illustrate this architecture with 
the description of a travel-planning assistant that supports the users in 
an exploratory, least-commitment search behavior. 



1 Introduction 

As the number of applications available on the World Wide Web (WEB) in- 
creases at a rapid speed, an enormous number of resources become available to 
the general public. These resources offer information and services in a variety of 
domains, such as stock-market, insurance purchasing, medical prescription or- 
dering and travel planning, just to name a few. However, these applications are 
often difficult to use. There are several reasons that contribute to this difficulty. 
Some WEB applications expose a cumbersome and unintuitive interaction model 
to their potential users because of their underlying implementations. For exam- 
ple, some WEB applications run on legacy systems whose interfaces are ASCII 
screens and require, in general, fairly long navigations in order to accomplish any 
particular task. In addition, the majority of WEB applications, whether they are 
re-engineered on top of legacy applications or they were developed specifically 
for the WEB, offer a deterministic interaction model: they always require a set 
of specific inputs from their users and force them to go through a single standard 
sequence of steps. Such interaction models are not applicable in cases where the 
user’s problem is not well defined, but is of the type “I don’t know what I want, 
but I will recognize it when I see it” . In such cases, the user has to commit 
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early to specific inputs without actually intending to, and if this early selection 
does not lead to a satisfactory solution, the user has to repeat the process with 
alternative input selections. To make matters worse, there is little commonality 
across WEB applications offering services in a domain. Any two applications in 
a given domain may use different terminology and interaction models for similar 
services. 

Mediation |yii()ll4| is the integration of diverse and heterogeneous informa- 
tion sources by abstracting away the representation differences among them, and 
by integrating their individual views of the application domain into a common 
model. A mediator interacts with the users, translates the users’ requests into 
interaction protocols with the wrappers of its underlying information sources, 
collects and processes (i.e., sorts, summarizes, or abstracts) the wrappers’ re- 
sponses, and, finally, presents the overall result to the user. 

To date, there are two veins of artificial intelligence research focusing on the 
two aspects of this overall problem. On one hand, there is work on automating 
the wrapper construction, such that the mediator can communicate in a canon- 
ical way with all its underlying sources. The wrapper construction process may 
be based either on user examples identifying the interesting elements on the 
presentation space that the source exposes mm, or, on an underlying domain 
model m- In the same area, some work has focused on the use of extensible 
Markup Language (XML) |15j as the representation language for exchanging 
information on and about WEB sources, enabling the storage of data together 
with its semantics Ha- 

Focusing more on the interaction of the mediator with the user, work on 
agents’ interaction languages has emphasized the need for well-defined semantics 
for high-level discourse. Elio and Haddadi m suggested that, in addition to 
performatives m, high-level goal-specific structures are necessary to express 
the semantics of the discourse between a human and an artificial agent who 
cooperate to accomplish a common goal, such as a successive refinement search 
task, for example. 

In our work on agent-based WEB-application integration, we adopt and in- 
tegrate several of the above lines of work: we have developed an architecture for 
task-specific mediation, in which, information sources within an application do- 
main are encapsulated within wrapper agents which interact with an intelligent 
intermediary agent, the mediator. The mediator is designed to support users 
performing a particular task in the given application domain, such as least- 
commitment, exploratory search. The wrappers are designed to expose aspects 
of the underlying sources’ functionality useful for the mediator’s task. The me- 
diator agent provides (a) a domain model to integrate the terminology of the 
integrated applications, and (b) a task model that drives its interaction with its 
users and its communication with the wrappers of the integrated applications. 
To date, we have developed a prototype of this architecture in the travel planning 
domain. 

The rest of this paper is organized as follows: section [2] discusses the overall 
mediation architecture, section [3] describes a prototype for a travel-planning 
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mediator developed in this architecture and illustrates the mediator’s process 
with a specific example, and finally section |4] outlines some early conclusions 
that can be drawn from this work and outlines our plans for future work. 

2 A Task-Structure Based Mediation Architecture 

In our work, we have focused on the problem of how a set of heterogeneous 
applications on the WEB can be integrated to deliver novel services within an 
application domain. The integration architecture discussed in this paper pro- 
poses a multi-agent approach to this problem. 

For our multi-agent integration architecture, we have adopted task structures 
as the mechanism underlying the agents’ behavior. Task structures [2] analyze 
problem-solving behavior in terms of elementary building blocks. Several for- 
malisms have been developed for representing task structures for a variety of 
purposes, ranging from simulating expert behavior in knowledge-based systems 
to modeling software systems. In particular, the SBF-TMK0 language was used 
to model the design of a software system so that an intelligent agent could mon- 
itor the system’s actual run-time behavior, recognize failures, and reconfigure it 
at run-time or possibly redesign its design elements m- 

In the SBF-TMK language, a task is characterized by the type(s) of informa- 
tion it consumes as input and produces as output, and the nature of the trans- 
formation it performs between the two. A complex task may be accomplished by 
one (or possibly more) method(s), which decomposes it into a non-deterministic 
network of simpler subtasks. A simple, non-decomposable task, i.e., a leaf task, 
corresponds to an elementary, executable operator. In addition to the system’s 
task structure, the SBF-TMK language also specifies the system domain, in 
terms of the types of objects that it contains, their relations and constraints on 
these relations. The information elements that flow through the task structure 
(produced and consumed by its tasks) are instances of the domain object types. 

This view of problem-solving behavior can be naturally transferred to our 
multi-agent integration architecture of mediator and wrappers. The behavior of 
the overall multi-agent system can be modeled in terms of a non-deterministic 
task structure. The wrappers of the individual sources implement elementary leaf 
tasks using the services of their underlying sources. The mediator implements 
a set of high-level complex tasks, with non-deterministic decompositions. The 
user’s interaction with the mediator decides which task is active at each point in 
time and which decomposition is employed. When a particular task is sufficiently 
decomposed into elementary tasks, the mediator requests the relevant wrappers 
to accomplish them and to return their results. In addition to decomposing high- 
level tasks into elementary ones, the mediator task structure also provides the 
road map for composing the individual wrappers’ results in a coherent solution 
to the overall task at hand. 

For these reasons, we have adopted the SBF-TMK language to describe the 
internal processes of the mediator and the wrapper agents. To that end, we have 

^ Structure-Behavior-Function models of Tasks, Methods and Knowledge 
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Fig. 1. The Overall Mediator- Wrappers Architecture. 



developed an XML representation for the SBF-TMK language. XML is a subset 
of SGML, that allows user-defined tags in order to best describe the semantics of 
the data content. The set of tags relevant to annotating entities, their attributes 
and their relationships in a particular domain is described in a DTD (Data Type 
Definition) document or an XML Schema. XML is fast becoming the de-facto 
standard for communicating on the WEB, and as more sources are specified in 
XML information extraction from these resources will become easier. Further- 
more, different views of XML data can be specified in terms of a XSL (extensible 
Style Language) stylesheets, thus enabling customization of the data content and 
its presentation style for different purposes and different user preferences. 

Figure [T] shows the architecture diagram of the integration architecture. It 
consists of three layers, the user interface layer (left area of the Figure), the 
mediator layer (middle area of the Figure) and the wrappers’ layer (right area 
of the Figure). There are three different types of tasks in a mediator’s task 
structure: (a) user- interaction tasks, (b) internal tasks, and (c) information- 
collection tasks. User-interaction tasks are denoted with “post-it” rectangular 
shapes in Figure |T]and implement the mediator’s interaction with the user, i.e., 
requesting (or displaying) information from (to) the user. Internal tasks are de- 
noted with rounded edge rectangles in the Figure, and accomplish information 
processing internal to the mediator. Finally, information-collection tasks are de- 
noted with ovals in the Figure, and correspond to the leaf tasks implemented by 
the wrappers collaborating with the mediator. 

The interaction between the user and mediator is supported by a XML 
browser, such as Internet Explorer. The mediator presents to the interface layer, 
i.e., the user browser the alternative high-level tasks that it can support as a 
menu of possible selections. Note that, although in Figure [Tithe mediator’s task 
structure is shown as a tree, it is in general a forest, i.e., it consists of a set 
of high-level tasks. This menu is automatically generated by an XSL stylesheet 
developed to translate task-structures to menu-driven interfaces. To initiate an 
interaction session with the mediator, the user has to specify the high-level task 
of this interaction session. In response, the mediator retrieves the task structure 
corresponding to the selected task and recursively descends from the root task 
to its subtasks. When, during this descent, a user-interaction task is reached, 
the mediator presents to the interface layer a form appropriate for the type of 
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information that the user has to provide or receive. The generation of the form 
is based on the mediator’s domain model. The domain model is specified by 
an XML document, describing the domain objects and their relationships. For 
each element type in this document, an XSL stylesheet is defined for displaying 
elements of this type and for receiving values of such elements from the user in 
a form-based interface. 

When the mediator reaches an information- collection task, it looks up the 
wrapper registry, denoted as the drum in the top-right corner of the mediator 
component in the Figure, to identify the wrappers that can fulfill the informa- 
tion transformation of the task at hand. The wrapper registry is essentially an 
association table mapping wrappers to the set of information-collection tasks 
that they can accomplish. Based on the wrapper registry, the mediator iden- 
tifies which wrappers can produce the output information given the available 
input and sends them a request. For each elementary task that it can perform, 
a wrapper has a corresponding plan for driving its underlying application. For 
WEB applications the plan consists of a sequences of HTTP “post” messages 
followed by parsing of the source response. Parsing is also based on the domain 
model. Each wrapper contains a generic parser and a copy of the XML document 
describing the domain, annotated with extraction rules for recognizing elements 
of each type in the source’s responses. This annotated XML document defines 
a grammar for appropriately interpreting the application response in terms of 
the common domain model. Thus, for every request it receives from the medi- 
ator, a wrapper executes a three-step process, involving (a) mapping the input 
provided by the mediator to the input parameters of the plan, (b) executing the 
plan and (c) extracting the output desired by the mediator from the source’s 
response. The mediator then collects the data and continues its execution of the 
task structure. 

When an internal task is reached, if it is complex (denoted as bigger rounded 
rectangles in the Figure) it simply sets up a set of simpler tasks. If it is elemen- 
tary (denoted as smaller rounded rectangles in the Figure) an internal mediator 
functionality is invoked to process the currently available information, i.e., to 
derive further information from the user’s input or to further process the infor- 
mation collected by the wrappers before presenting it to he user. These tasks 
enable the integration of further intelligent processing of the data available in 
the domain. 

3 The Travel Agent Prototype 

Let us now discuss an instantiation of this architecture, consisting of a travel 
agent mediator and a set of related brokers, to illustrate our ideas discussed 
above. Personal travel assistants are one of the seven areas specified by FIPA 
jT]. In this specification, 

...the PTA interacts with the user and with other agents, representing the avail- 
able travel services. The agent system is responsible for the configuration and 
delivery - at the right time, east. Quality of Service, and appropriate security 
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and privacy measures - of trip planning and guidance services. It provides exam- 
ples of agent technologies for both the hard requirements of travel such as airline, 
hotel, and ear arrangements as well as the soft added-value services aeeording to 
personal profiles, e.g. interests in sports, theater, or other attractions and events. 



Our travel mediator currently interacts with the following information sources: 

— An XML table, describing airports, their codes and their location in terms 
of country, state and city. This table was produced by wrapping the HTML 
source at http://www.cowtown.net/users/rcr/aaa/ccmain.htm. 

— ITN, an on-line travel reservation system, exposing a form interface to the 
users: http://www.itn.com/. A wrapper was hand-crafted for ITN. 

Figure Eldepicts the travel mediator’s domain model (top of the Figure) and 
task structure (bottom of the Figure). The domain objects include air ticket, 
date, location, flight and fare. A date is specified in terms of a year, month, day 
and time. A location is specified in terms of a country, state, city and airport. 
A flight is specified in terms of an airline and flight code. A fare is specified in 
terms of currency and amount. An air ticket is specified in terms of origin and 
destination, both of which are locations, departure and return dates, both of 
which are dates, departure and return flight which are flights, and air fare which 
is fare. 

An interesting aspect of the model is that the attributes of location and 
time have domain-level relationships. For example, country, state and city are 
all attributes of location, but at different levels of specificity, state is at a lower 
(more specific) level than country, and at a higher (more general) level than city. 
The underlying intuition is that values at one level are collapsed and summarized 
at the next higher level, and a value at one level corresponds to a collection of 
values at the next lower level. 

The concept of domain level is very important for supporting flexible, ex- 
ploratory search behaviors. It enables the interpretation of vague problem state- 
ments, i.e., statements expressed in terms of attribute values at a high level, into 
a collection of specific ones, i.e., expressed in terms of their corresponding values 
at lower levels. So for example, when the user does not want to commit to a 
specific destination airport but only to a city, the mediator is able to translate 
this vague problem statement in a collection of problems each of which has as 
the destination airport one of the airports close to the given city. Alternatively, 
when the user has made strong and mutually exclusive commitments, the me- 
diator may explore possible relaxations by substituting the specific elements of 
the problem statement with more general ones at a higher level of abstraction. 

The mediator’s overall task (shown at the bottom of Figure E) is travel 
planning, which is decomposed in (any of) the following tasks: finding airfare, 
renting a ear and making a hotel reservation. All these three tasks are instances 
of a single generic task, i.e., exploratory search - indicated by the dashed line 
arrows in the Figure. The outputs of these tasks are correspondingly an air 
ticket, a ear rental and a hotel reservation. Because these tasks are all instances 
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of the exploratory search task, their inputs can be any combination of values for 
the attributes of the desired output. So for example, the input of the fare-finding 
task can be any combination of values for the air-ticket attributes, that is origin 
and destination locations, and departure and return dates. The user may specify 
these inputs at any level of specificity, i.e., at any level in the location and date 
domains. 






display 

results 



Fig. 2. A diagrammatic depiction of the mediator’s domain model (top) and 
task structure (bottom). 



The search task structure of the travel-assistant mediator is designed for ill- 
structured problems, where a precise specification of the desired output is not 
available and instead, the user explores the solution space following a least com- 












Task-Structure Based Mediation: The Travel-Planning Assistant Example 407 



mitment strategy m- The overall search task is decomposed into six subtasks. 
First, the problem inputs are specified by the user. The problem inputs are at- 
tributes of the solution, so essentially in this task the user specifies the problem 
as a set of constraints on the range of values that different attributes of the 
solution may take. The mediator then generates a set of detailed problem speci- 
fications, by refining all the abstract elements of the input problem specification 
to their most specific domain levels. Then the mediator identifies the relevant 
wrappers that are able to solve the problem, communicates the problem to them 
and collects their solutions. If no solution is found, the mediator requires the user 
to relax some of the original constraints, i.e., to specify some input at a higher 
domain level, and it tries again. If some solutions are found, the mediator can 
further process the solution set by deriving additional attributes for the solu- 
tions, such as total length of travel for example. Finally, the mediator evaluates 
the size of the result set. If it is small, the mediator presents all the solutions 
to the user, and allows the user to decide on one of the returned solutions or 
to further modify (relax or strengthen) the constraints. If the solution set is too 
large, the mediator proposes to the user dimensions of the input specification 
that could be further constrained, and gives appropriate suggestions to the user. 
The process continues until either a satisfying result is found or the user quits. 

When this search task structure is invoked in service of the air ticket finding 
task, it is instantiated in the context of this task’s input-output transformation. 
Thus, first the ticket finding inputs are collected, and subsequently refined, then 
the appropriate wrappers of ticket reservation sources are invoked to identify 
possible tickets that would correspond to the problem in question and then the 
resulting tickets are collected. The tickets are presented to the user in a tabular 
format that can be sorted according to all its attributes, such as price or length 
of travel or length of stopovers for example. By reviewing the ticket collection 
according to different attributes, the user may select a specific ticket or decide 
on how to further refine the problem specification. 

3.1 An Example Scenario 

In this section, we discuss a specific scenario to illustrate how the exploratory 
search behavior is supported in the context of finding air tickets. A user plans 
a vacation “to California in July”. “July” and “California” are too vague for 
querying any of the current travel-agent WEB applications. The specific desti- 
nation and date may eventually depend on the availability of flight seats and 
the price of the airfare, but in order to make more precise decisions the user 
may want to evaluate the possible options. For such problems, the specification 
of the problem may shift around within the domain. The above specification 
implies a range of destinations and a range of departure and arrival dates. The 
mediator queries the city database to find all the cities in California, and then 
the airport database to find the airports in those cities and the airport in Ed- 
monton. In this manner, the origin and destination are defined at the lowest, 
most precise domain level for locations. These queries are executed through an 
XQL engine |S], because the airport database is in XML. It then identifies, by 
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executing an internal task, specific departure and arrival dates - two weeks apart 
from each other. 

All data related to specific user-mediator sessions are contained within the 
mediator’s session DOMs. The Document Object Model (DOM) is a program- 
ming API for HTML and XML documents. It defines the logical structure of 
documents and the way a document is accessed and manipulated. In the Doc- 
ument Object Model, documents have a logical structure which is very much 
like a tree. The mediator’s session DOM is used to store the data for every task 
execution. The session DOM has one child node for each user; each user node 
has one child node for each task; and finally each task node has two child nodes, 
one for input data and the other for output data. 

The mediator uses the resulting combinations as problem specifications with 
which to query ITN. It sends the task DOM to the ITN wrapper. The task DOM 
contains the user input data and leaves the output elements to be filled. The 
ITN wrapper maps the task DOM in the parameters of the ITN method, and 
when it receives the method’s response, it extracts from it the data of interest 
and to translate it in XML. The wrapper finally returns the task DOM to the 
mediator. 

After the entire task is accomplished, i.e. the DOM has been updated by 
all the relevant wrappers, the mediation layer sends the result DOM back to 
the interface layer. For exploratory search tasks, the result DOM is essentially 
a table of possible solutions to the user’s input problem. To enable the user 
to better inspect the result, we have developed an XSL stylesheet that enables 
the user to sort and view the result in different ways. Thus, in our particular 
scenario, a table of the found tickets is presented to the user, who can sort it 
according to different ticket attributes. For example, the user may decide to see 
tickets in increasing price order and to select the first one, i.e., the cheapest. Or 
alternatively, the user may order the tickets according to departure dates and 
select some ticket toward the middle of July. Finally, the user might choose to 
completely abandon “California” as a destination and go to “Vancouver Island” 
instead, in which case the mediator will repeat the search process from the start. 

4 Reflections and Conclnsions 

Let us now reflect upon the generality/extensibility of this approach and the 
cost involved in its adoption in novel application domains. Our travel-planning 
assistant currently “talks” with only two wrappers and offers limited solution 
post-processing , i.e., inference of some derived flight attributes and flexible result 
sorting. Its extension with further capabilities would simply involve the editing of 
its task-structure XML document to include new tasks and the implementation 
of java methods for these tasks. The task-structure interpretation mechanism 
would not need to be modified. Similarly, additional wrappers could simply be 
integrated by additional entries in the registry table. 

What is involved, however, in developing a similar exploratory search media- 
tor in another application domain? First the domain has to be modeled and the 
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stylesheets of the new domain elements have to be developed. This is currently a 
manual process, but various business and technology sectors have embarked on 
the specification of standard XML schemas for exchanging information in their 
domains of interest. These efforts would potentially eliminate this task in the 
future, for domains that have been thus standardized. The search task struc- 
ture would be mostly transferable with the exception of many of the domain- 
dependent post-processing subtasks. Some of the post-processing subtasks, such 
as sorting for example, are domain-independent and transferable. Finally, the 
sources would have to be wrapped. This is currently a manual process in our 
environment, but we are in the process of investigating related work mu on 
automating it. 

Finally, we want to close with a few concluding remarks. In this paper, we 
discussed an architecture for integration of WEB applications, based on a multi- 
agent system consisting of a mediator and a set of wrappers. The wrapper agents 
act as “representatives” of the functionalities that their underlying applications 
can deliver. The mediator’s internal process is defined in terms of a task struc- 
ture. The mediator’s task structure drives the interaction with its users and its 
communication with its associated wrappers. In particular, we have developed 
a generic exploratory-search task structure that allows the user to specify the 
problem at an abstract level and does not force early commitments on the de- 
sired attributes of the output. Such search mediators can support the users in 
exploring large areas of the problem space, and enable them to make decisions 
based on the results collected from the exploration. This search task structure 
can be instantiated in different domains, as long as objects in these domains can 
take values in spaces organized in an abstraction hierarchy. 

Both the mediator’s task structure and domain model are specified in XML. 
Thus task structures as well as domain models can be shared among mediators. 
In addition to their XML specification, the domain and task structure have also 
associated XSL stylesheets which specify the interaction (elements, presentation, 
layout and navigation) between the mediator and its users. The resulting inter- 
face is consistent and intuitive, since it directly corresponds to the task structure 
and the domain model. 
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Abstract. We develop two specific operators for modifying a knowl- 
edge base: Amend updates a knowledge base with a formula, while the 
complementary operator Forget removes a formula from the knowledge 
base. The approach differs from other belief change operators in that the 
definition of the operators is compositional with respect to the sentence 
added or removed. For example. Forget applied to a conjunction is defined 
to be the union of Forget applied to the individual conjuncts. In some 
cases the resulting approach has better complexity characteristics than 
other operators; however, this is at the expense of irrelevance of syntax 
and other of the Katsuno-Mendelzon update postulates. Alternately, we 
achieve full irrelevance of syntax if the sentence for update is replaced 
by the disjunction of its prime implicants and conversely the sentence 
for removal is replaced by the conjunction of its prime implicates. 



1 Introduction 

A knowledge base, consisting of a set of facts, assertions, beliefs, etc. is a central 
component of a declarative AI system. Such a knowledge base will evolve over 
time, with the addition of new information and the deletion of old or out-of-date 
information. A fundamental question concerns how such change should be ef- 
fected. One major body of research has addressed this question via the proposal 
of various rationality postulates, or rules that any adequate change operator 
should be expected to satisfy. These postulates describe belief change at the 
knowledge level, that is on an abstract level, independent of how beliefs are rep- 
resented and manipulated. In the AGM approach of Alchourron, Gardenfors, and 
Makinson | |AGM85llGar88 |. standards for revision and contraction functions are 
given, wherein it is assumed that a knowledge base is receiving information con- 
cerning a static domain. Subsequently, Katsuno and Mendelzon [IKM92j explored 
a distinct notion of belief change, with functions for belief update and erasure, 
wherein an agent changes its beliefs in response to changes in the environment. 
See [KM92] for a comparison between revision and update. Various researchers, 
including |Bor85l IDal88l IFor89l ISat88l IWeb86L IWin88] have proposed specific 
change operators. 

In this paper we also propose specific update and erasure operators; however, 
our point of departure from previous work is that we develop operators intended 
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to be compositional in the sentence (j) representing information to be added or 
removed. For example, if a conjunction (f> = a/\h is to be erased from a knowledge 
base ijji the intuition is that this can be effected by the erasure of a or b; the 
result of the erase function in this case then would be the union of the erasure 
of a and, separately, b. 

Arguably the approach yields plausible update and erase functions. However, 
we do not obtain the full set of update and erase postulates of | KM92| . In 
particular, we lose update postulates U2, U4 and U6, and erase postulates E2 and 
E4. We suggest means by which the full set of update and erase postulates can 
be obtained. For example, we can recover the irrelevance of syntax postulate for 
erasure (E4) by replacing a formula for erasure by its prime implicants. In return 
for this “syntactic” aspect of the operators, we obtain improved complexity 
results in some cases compared to other operators. 

The next section reviews belief update and erasure, and describes the Winslett 
approach to update. Following this we describe our approach and in the next 
section present a discussion and analysis. Proof of theorems not given here may 
be found in |SD00] . 

2 Background 

2.1 Belief Update and Erasure 

A belief set is defined as a set K of sentences in some language L which satisfies 
the constraint: If K logically entails (3 then (3 G K. T is the set of belief sets. A 
formula is said to be complete just if it implies the truth or falsity of every other 
formula. Thus a complete, consistent formula provides a syntactic equivalent to 
an interpretation. 

In the approach of [KM92j , an update function o is a function from T x L to 
T satisfying the following postulates. 

(Ul) ifop,\-fi. 

(U 2 ) If i/) h /r then ip o pi = ip. 

(U 3 ) If both fj, and ip are satisfiable then ip o p is satisfiable. 

(U 4 ) If ipi = ip2 and Pi = p2 then ipi o pi = ip2<> P2- 
(U 5 ) {ipo p) f\(p implies ip o {p A (p). 

(U 6 ) If V' o Ml ^ M2, and ip <> p2 ^ Pi then ip o pi =ipo p2. 

(U 7 ) If Ip is complete then {ip o pi) A {ip o P2) implies ip o {pi M P2)- 
(U 8 ) {ipi \J ip2)o p = {ipi o /i) V {ip2 o p) 

An erasure function ■ is a function from T x L to T satisfying the following 
postulates. 

(El) ip\- Ip m p. 

(E 2 ) li ip\ <p then ip m p = ip. 

(E 3 ) If Ip is satisfiable and p is not a tautology then ip m p\f p. 

(E 4 ) If ipi = ip2 and Pi = p2 then ipi m pi = ip2 m p2. 

(E 5 ) {ip m p) A p \- Ip. 

(E 8 ) {ipi V 1P2) *pis equivalent to {ipi ■ m) V {ip2 ■ p) 
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For a given update operator o, an erasure operator ■ can be defined by: 

■0 . ^ = V' V (■(/; o (1) 

For a given erasure operator ■ , an update operator o can be defined by: 

■0 O ^ = {'ijj m ^fj.) A /i. (2) 

We obtain: 

Theorem 1 ( [KM92J ). 

1. If an update operator o satisfies U1-U4 and U8 then the erasure operator 
m defined by Equation Up satisfies E1-E5 and E8. 

2. If an erasure operator ■ satisfies E1-E5 and E8 then the update operator 
defined by Equation satisfies U1-U4 and U8. 

2.2 The Possible Worlds Approach to Update 

Winslett’s Possible Models Approach (PMA) jWin88| is a specific example of 
an update operator satisfying the update postulates. The PMA update operator 
is denoted Opma- To calculate ip Opma /r, we have that, for each interpretation 
I of 0, Opma selects from the interpretations of fj, that are “closest” to I. The 
update is determined by the set of these closest interpretations. The notion of 
“closeness” between two interpretations / and J is defined as follows: 

diff{I, J) = The set of all propositional letters on which I and J differ. 

Interpretation Ji is closer to I than J 2 , expressed Ji <i^pma </ 2 , just if diff{I, Ji) 
C diff{I,J 2 ). (If we measure closeness by the cardinality of the sets diff{I,J) 
we obtain the approach of [For 891 .1 The </_pma-niinimal set with respect to p, 
is designated Incorporate{Mod{p),I). 

From this we can specify the PMA update operator: 

Mod{'ip Opma p) = [J Incorporate{Mod{p), I). 

I^Mod{il)) 

The following example (from |KM92| 1 illustrates the PMA update function. 
Let our language have just two propositional letters: L = {6,m}. Let 0 be 
equivalent to (bA^m) V (^6 A m). For concreteness, take b to mean “the book is 
on the fioor”, and m to mean “the magazine is on the floor”. So 0 means that 
either the book or the magazine is on the fioor, but not both. Consider updating 
with p as b (say, a robot is ordered to put the book on the fioor). Intuitively, 
at the end of this action the book will be on the fioor, and the location of the 
magazine will be unknown. 

Under the PMA approach we have the following. The interpretations of 0 
are: I\ = {^b,m), I 2 = {b,^m)] and the interpretations of p are: J\ = {b,m), 
J 2 = (b.^m). diff{Ii,Ji) = {b} and diff{h,J 2 ) = {b,m}, hence Ji <p,pma J 2 , 
and so Incorporate{Mod{p),I\) = b. Similarly, Incorporate{Mod{p), I 2 ) = {}. 
Hence, 0 Opma p = b, as desired. 
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3 The Approach 

3.1 Preliminaries 

The underlying logic will be classical propositional logic. We consider a propo- 
sitional language L, over a finite set of atoms, or propositional letters, P = 
{a, 6 , c, ...}, and truth-functional connectives A, V, D, and =. Lits is the 
set of literals: Lits = P U {^l \ I e P}- An interpretation of L is a function 
from P to {T,F}. A model of a sentence a is an interpretation that makes a 
true, according to the usual definition of truth. A model can be equated with 
its defining set of literals. Mod{a) denotes the set of models of sentence a. We 
will equate a knowledge base 'll) with the set of its models, or equivalently, with 
the set of sets of literals each of which defines a model of ip. JC will be the set 
of knowledge bases. For interpretation u) we write eo \= a for a is true in eo. 
For interpretation uj and literal I, we write w \ ^ to denote the set of literals in 
Lu but with the literal mentioning the atomic sentence in I removed. Thus for 
L = {a, b} and co = {a, ~^b} we have uj\b = {a}. 

3.2 Intuitions 

Two operators are defined for changing the state of a propositional knowledge 
base, one for update and the other for erasure. These operators are defined in a 
compositional fashion so that, for updating or erasing formula <j), the operator is 
defined in terms of the components of (f). We begin by considering how this can 
be effected for an erasure operator, called here Forget. 

For the base case, suppose (j) = I is a, literal, and we wish to remove I from 
knowledge base ^/>. Clearly, simply adding (however “add” is defined) is too 
strong. Instead, we want to change the knowledge base only enough so that it 
does not entail 1 . We can do this by adding, for each model uj G such that 
LJ \= I an interpretation uj' = to the knowledge base. Thus, we 

would have both u)' and oj in the resulting knowledge base. The intuition is that 
for each model, or possible state of affairs, io where I is true, we add a model uj' 
exactly like uj, except that uj' ^ ^l. 

Next consider removing a conjunction of literals h A I2 from a knowledge 
base. In this case, minimal change means allowing the knowledge base to entail 
h V I2 at the most. Intuitively, we want the result of Forget of h A I2 to be the 
union of Forget of h and Forget of ?2- We can do this by adding, for each uj G 
such that UJ |= an interpretation uj' = (uj \ h) U {^?i} and, for each uj G ip 
such that UJ ^ I2, an interpretation u' = (u\ I2) U {^h}- 

Finally, to remove a disjunction of literals l\ V I2 from a knowledge base 
we want to add “sufficient” interpretations in which neither li nor I2 are true. 
We can accomplish this by adding, for each u G ip such that a; ^ /i V I2, an 
interpretation u' = ((u \ li) \ I2) U {^^i, ^^2}- 

We generalize Forget to handle arbitrary propositional formulas by gener- 
alizing how it handles conjunctions and disjunctions. So for conjunctions, we 
extend the definition to conjunctions of formulas rather than just literals. So to 
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Forget a conjunction of formulas, we recursively Forget each formula and then 
take the union of the results. For disjunctions, we similarly extend the definition 
of the operator to handle arbitrary disjunctions. 

3.3 Update and Erasure Operators 

Based on these intuitions, we define update and erasure operators. Amend and 
Forget respectively. We begin with some preliminary definitions. 

Definition 1. For interpretation oj and F C L, define Flip(uj, F) as follows: 

1- if r C Fits then 

if 31,^1 G F then Flip(uj,r) = {}; otherwise, 

Flip{u!, F) = {{uj\ F) Li {I I ~^l G F and I G P} U \ I G F C] P}} 

2. If F = {aA/3}U F' then Flip{oj, F) = Flip{uj, {a} U F') U Flip{u, {/?} U F') 

3. // F = {a V /3} U F’ then Flip{uj, F) = Flipfju, {a, /3} U F') 

4- If F = {-'(a V (3)} U F' then Flip{uj, F) = Flip{uj, {^a A ->/3} U F') 

5. If F = {-'(a A /?)} U F' then Flip{ijj, F) = Flipfjo, {^a V ->/3} U F') 

6. If F = {->^q;} U F' then Flip{uj, F) = Flip{uj, {a} U F') 

For knowledge base tf, New{tfj,fi) is the set of new interpretations added to f) 
necessary to Forget the formula g. 

Definition 2. New{tp,g) = {oj' \ u' G Flip{io,{g\) where u G if}. 

Definition 3. Forget{ip, g) = ifU Newftf, g). 

From Definition I2] and Equation!^ we can define two update operators based on 
Forget, called Amend and Amend': 



We obtain: 

Theorem 2. For if G 1C and g G L, we have Amend{if,g) = Amend' {if , g) . 

4 Properties of the Operators 

To start, we determine which of the the Katsuno-Mendelzon postulates our def- 
initions of Forget and Amend satisfy. 

Theorem 3. Forget satisfies El, E3, E5 and E8 but not E2 and E4. 

Intuitively, E2 is not satisfied because Forget may add additional interpreta- 
tions not in if that entail ^g. Therefore, if if already contains all interpretations 
that entail ^g then the knowledge base will not change. See below for a coun- 
terexample to E4 and Section ITT] for a counterexample to E2. 

The postulates that Amend fails to satisfy are directly related to those that 
Forget fails to satisfy. 



Amend{if,g) = New{if,^g) 

Amend' {if, g) = Forget{if,^g) n Mod{g) 



( 3 ) 

( 4 ) 
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Theorem 4. Amend satisfies Ul, U3, U5, U7 and U8 but not U2, U4 and 

U6. 

It does not satisfy U2 and U6 for exactly the same reason that Forget does 
not satisfy E2; that is, Amendff,^) may add additional interpretations not in 
that entail g. We pursue this behaviour further in the next section, where we 
use this as a point of contrast with Winslett’s approach. Similarly, U4 does not 
hold for the same reason that E4 doesn’t hold. 

Despite failing to satisfy some postulates. Forget and Amend do exhibit a 
nice property that operators satisfying all the update or erasure postulates fail 
to satisfy. In the following, let V’ be a knowledge base and g, f and g A (p he 
satisfiable propositional formulas. 

Theorem 5. Forgetfip, g A f) = Forgetff, g) U Forgetfip, (p) 

Theorem 6. Amend{'tp, gV (p) = Amendfip, g) U Amendfip, (p) 

The reason that E4 and U4 are not satisfied is that in our compositional 
approach, separate parts of a formula may “interact” to provide implicit results 
not explicit in the formula. For example, consider (^a V 6 ) A (^6 V c). Updat- 
ing a knowledge base by this formula is effected by updating by the individual 
components, viz (^a V b) and (^6 V c). However, implicit in these parts is the 
fact that (^a V c) is also true, and the presence of this formula would affect the 
result of the update. It would seem that if we could “compile out” all the im- 
plicit information in a formula then we would obtain substitution of equivalents, 
as expressed in E4 and U4. One way to satisfy E4 and U4 then is to redefine 
Amend and Forget so that we first “compile out” implicit information. We do 
this by defining operators that consider the prime implicates of a formula. We 
call these modified operators For get- PI and Amend- PI. For g G L, let PI{g) 
be the set of prime implicates of g. 

Definition 4. 

Forget-PI{ip, g) = F or get {ip, P 1 (g)) 

Amend-PI (ip , g) = New{ip, PI{^g)) 

Surprisingly, although we gain E4 and U4, we lose U7 with Amend-PI. A 
counter-example for U7 consists of formulas gi = (a A d) V (^c A d) and g2 = 
{^a A d) V (^c A d) and a complete knowledge base ip that entails a A b A c A d. 
By definition, 

Amend-PI{ip, gi) n Amend-PI{ip, ^2) 

= New{ip, {-^a V ^d) A (c V ^d)) n New{ip, (a V ^d) A (c V ^d)) 

which entails a A b A ^e A d. On the other hand. Amend- PI (ip, g\ V g2) = 
New{ip, ~^d) which is equal to ip. Notice that the prime implicates of ~^gi and ~^g2 
retain the clause (cV^d) whereas the only prime implicate of ^(/ii V/Z 2 ) is ^d. As 
a result, both Amend-PI (ip, gi) and Amend-PI (ip, g2) contain an interpretation 
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to' just like u! £ ij} except that c is negated whereas Amend- PI V /i2) con- 
tains only the interpretations in ip. Thus, Amend- PI {ip, fii) n Amend- PI {ip, ^ 2 ) 
does not imply Amend-PI{ip, jjLi V ^2)- 

Finally, E2 and U2 can also be satisfied by adding a condition to Amend and 
Forget. Essentially, E2 states that if we wish to remove g and the knowledge base 
already implies then don’t bother changing the knowledge base. Similarly, 
U2 states that if we wish to update by g and the knowledge base already implies 
g,, then don’t bother changing the knowledge base. Call these modified operators 
Forget' and Amend'. We obtain: 



Definition 5. 

Forget' {ip, g) 

Amend' {ip, g) 



(ip if Ip 'r ^g 

y Forget{ip, g) otherwise 
(ip if Ip g 

\ Amend{ip, g) otherwise 



4.1 Comparison with Related Work 

In this section we compare Amend to Winslett’s Possible Models Approach in 
[Win88| . Winslett’s Possible Models Approach exhibits several attractive proper- 
ties. First, it satisfies U1-U8. Second, it operates well in applications involving 
reasoning about action. Third, it supports protected formulas in a knowledge 
base, where a protected formula is one that must remain true before and after 
knowledge change. 

Clearly, Winslett’s operator is not equivalent to Amend, since Winslett’s 
operator satisfies all of the update postulates whereas Amend does not. The 
following example demonstrates this point and in fact exploits the fact that 
Winslett’s operator satisfies U2 but Amend does not. Consider the example of 
the book and magazine from earlier, where ip = {b A —•m) V {~^b A m). However, 
this time the formula for updating is b \/ m. In our approach, the models of 
the update formula are Mi = b Am, M 2 = b A -•m and M3 = -^b Am. Hence 
Amend{ip, 5 V m) = Mod{b V m). Winslett’s operator on the other hand returns 
the original knowledge base ip, in accordance with U2. Amend, on the other 
hand, adds an additional model. This appears to make some sense because by 
updating by 5 V m, we are really telling the knowledge base that the world has 
changed so that one of 6 A m or 6 A ^m or M Am is true. 

Winslett’s approach supports protected formulas. By slightly modifying the 
definitions of Forget and Amend, it is a straightforward matter to have these 
operators also support protected formulas. One way is to redefine Flip{uj, F) to 
replace every interpretation normally in Flip{oj, F) that contradicts a protected 
formula with the interpretation uj. In other words, we simply ignore interpreta- 
tions that contradict the protected formulas. 



4.2 Algorithms and Complexity 

In this section we provide algorithms for our operators. We also analyze the 
complexity of these algorithms under a variety of assumptions. Specifically, we 
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analyze the complexity of the algorithms when applied to any general propo- 
sitional formulas, any formulas in conjunctive normal form, any formulas in 
disjunctive normal form, and any formulas whose sizes are below some specified 
constant. 

In the following algorithms defined for Forget, Amend and New, we have 
Ip, g, £ L and F C L: 

Algorithm Forget-Alg{'tp, g) 

1. return ip V New-Alg{ip, g) 

Algorithm Amend- Algpip, g) 

1. vetvLYn New-Alg{ip,^g) 

Algorithm N ew-Alg{ip , g) 

1. Ip' ^ False 

2. for F G Flip-Sets({g}) 

3. Ip" ^ NNF{iP) 

4. for I G F 

5. itlGP 

6. replace each occurence of in ip" with I 

7. else 

8. let r G P be such that = I 

9. replace each I' in ip" not proceeded by “-i” with I 

10. end if 

11. end for 

12 . iP' ^iP'\J {iP" /\F) 

13. end for 

14. return ip' 

Algorithm NNF{g) 

1. ii g G Lits return g 

2. else it g = a A p) return NNF{a) A NNF{P) 

3. else it g = a\/ P return NNF(a) V NNF{P) 

4. else it g = V /?) return NNF{—^a) A NNF{^P) 

5. else it g = A /?) return NNF(^a) V NNF{^P) 

6. else it g = return NNF{a) 

Algorithm Flip-Sets{F) 

1. if P C Lits 

2. it 31,^1 G F return {} 

3. else return {{? \ G F and I G P} U {^l \ I G F D P}} 

4. else if r = {a A P} U F' 

return Flip-Sets{{a} U F') U Flip-Sets{{P} U F') 

5. else if r = {aW P}U F' return Flip-Sets{{a, P} U F') 

6. else if r = {^(a V /?)} U F' return Flip-Sets{{^a A ->/J} U F') 

7. else if r = {^(a A /?)} U F' return Flip-Sets{{^a V ->/J} U F') 

8. else if P = {^^a} U F' return Flip-Sets{{a} U F') 

Note that NNF just returns the negation normal form a formula. 
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The following theorems state that these algorithms are complete and sound 
with respect to the operators. 

Theorem 7. New{Mod{tjj), fj.) = Mod{New-Alg{ij}, fj)) . 



Corollary 1. Forget{Mod{tp),g,) = Mod{Forget-Alg{'tj;,g)). 

Corollary 2. Amend{M od{ip) , g) = Mod{Amend-Alg(gp, g)). 

We now give the complexity of the algorithms. Let tj},g G L and let ||i5|| be 
the size of 6 G L. 

Theorem 8. The complexity of evaluating Forget-Algftp, g) is: 

1. 0(||'!/'||2ll'^ll) whenever g is any general propositional formula, 

2. OdIV'll IImII) whenever g is in conjunctive normal form and 

3. 0(||'i/;||) whenever the size of g is bounded by some constant. 

Theorem 9. The complexity of evaluating Amend- Algipf , g) is: 

1. 0(||'i/;||2ll'^ll) whenever g is any general propositional formula, 

2. OdIV'll IImII) whenever g is in disjunctive normal form and 

3. Odl'i/'ll) whenever the size of g is bounded by some constant. 

Both theorems show that, like most other model-based change operators, 
the complexity of the algorithm is in the size of the update formula [EG92j . 
Also, because of how Flip is defined, we see that Forget- Alg is quite efficient 
when the formula to remove is in conjunctive normal form and Amend- Alg is 
quite efficient when the formula to add is in disjunctive normal form. According 
to | EG92| the major model-based operators are at least co-NP-Complete when 
the update formulas are in either of these two forms. Since it is reasonable for 
practical knowledge base systems to put limits on the size or form of the update 
formulas, these are positive results. 

5 Conclusion 

We have presented two belief change operators concerned with updating a knowl- 
edge base. Amend updates a knowledge base with a formula, while Forget removes 
a formula from the knowledge base. The definition of the operators is intended 
to be compositional with respect to the sentence added or removed. As a result 
we lose some of the standard postulates for update and erasure. We achieve full 
irrelevance of syntax if the sentence for update is replaced by the disjunction of 
its prime implicants and conversely the sentence for removal is replaced by the 
conjunction of its prime implicates. The approach is illustrated with examples 
and through comparison with other approaches. We argue that the approach is 
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interesting first, because it is founded on differing intuitions than other opera- 
tors, in that it is based on compositional concerns, and second, that it allows a 
straightforward, and in some cases more efficient, implementation. 

There are still unexplored questions about this approach. One is whether 
these operators may provide an appropriate model for updating independent 
“belief cells” (as given for example in the components of a formula), or as an 
operator combining syntactic and semantic characteristics. Another is whether 
our approach can be extended to full predicate calculus. A third question con- 
cerns the behavior of Amend as an erasure operator. For example, let ip be the 
knowledge base equivalent to (a V 6) A {^a V ~^b). Then, if we use Amend to 
add the formula a \/ b, which is already in ip, the resulting knowledge base is 
equivalent to a V 6. Thus, in the process of using Amend to add a formula to the 
knowledge base, we have actually just removed one. In general. Amend appears 
to behave as an erasure operator when it is used to add disjunctions of formulas 
already in the knowledge base. 
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Abstract. This work examines the relevance of uncertain temporal in- 
formation. A key observation that motivates the analysis presented here 
is that in the presence of uncertainty, relevance of information degener- 
ates as time evolves. This paper presents an empirical quantitative study 
of the degeneration of relevance in time-sliced Belief Networks that aims 
at extending known results. A simple technique for estimating an up- 
per bound on the relevance time is presented. To validate the proposed 
technique, results of experiments using realistic and synthetic time-sliced 
belief networks are presented. The results show that the proposed upper 
bound holds in more than 98% of the experiments. These results have 
been obtained using a modified version of the dynamic belief networks 
roll-up algorithm. 



1 Introduction 

The success of intelligent systems in performing useful tasks depends to a large 
extent on the availability of relevant knowledge. A paradox arises in many prac- 
tical systems when the amount of knowledge grows beyond the capacity of these 
systems. Too much knowledge usually slows down the system due to the compu- 
tational complexity of reasoning. The slower performance threatens the practi- 
cality of intelligent systems and their success. Hence, it is desirable to have the 
knowledge base as concise as possible. Nevertheless, coverage of different situ- 
ations and exceptions is an equally desirable feature. The study of information 
relevance tries to determine what information is relevant to a particular task 
and what can be ignored without compromising the conclusions. For example, 
in a query answering task, the subset of the theory relevant to a particular query 
can be defined as follows. 

Definition Given a theory O consisting of a set of assertions (e.g.. sentences 
of logic) and a query Q consisting of a conjunction Q = (gi A g2 A . . . A (/„), the 
relevant theory 0q is a minimal subset of 0 such that G Q, qi follows from 
0q iff Qi also follows from 0. 
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The above definition does not immediately extend to probabilistic queries 
since there is no straightforward translation that maps a logical implication to 
a conditional probability. For the present discussion, assume that all the qi’s in 
the query Q are binary random variables, and consider the theory 0 as a set of 
assertions describing a joint probability distribution over all variable^. 

A probabilistic definition of irrelevance would require that the probability of 
the query remains unchanged in the full theory and the abridged one, 

P{Q\0) = P{Q\0q). 

Moreover, this equality holds for all outcomes in Q ( given 0 and 0q). 

The objective here is to study irrelevance in temporal probabilistic domains. 
These domains are dynamic with uncertain temporal evolution. The ability to 
forecast and explain changes in such domains is required for a wide range of 
applications including planning, diagnosis, natural language understanding and 
scheduling. Efficient performance of such tasks requires a theory of relevance in 
dynamic uncertain domains. In such domains, two qualitative notions of irrele- 
vance of information fit directly into the frame of irrelevance as independence. 
These two notions are unconditional independence and conditional independence. 
It is sufficient to find the dependent subnet which can be done in polynomial time 
using d-separability. A procedure for building multiple instantiation Bayesian 
networks to answer a query starts by building evidence and query subnetworks 
is presented in m- In that work, a pruning phase eliminates independent or 
conditionally independent portions of a large time-sliced belief network. 

The common sense law of inertia assumes that a state persists indefinitely 
jS]. Indefinite persistence is an idealization rarely useful in practical applications. 
The ability to model the gradual decay of persistence over time has been part of 
the appeal of time-sliced belief networks |3]. Here, we divide the time following 
an observation into two periods: a relevance period during which the observation 
continues to be relevant with respect to beliefs and the irrelevance period during 
which the observation becomes completely irrelevant. 

However, persistence tells only part of the story because an observation is 
also relevant if it helps us predict a future state. For example, observing that 
the sun is shining outside now justifies predicting that it will be daylight at the 
same time tomorrow, even though the sunshine does not persist. Relevance is 
about predictability in general, and persistence is just one aspect that allows 
this predictability. 

The goal of this work is to predict the relevance time of an observation 
when time-sliced belief networks are used for representation and reasoning. The 
technical motivation behind this work is to achieve performance improvements 
by weeding out some irrelevant information. The approach taken here relies on 
the commonsense notion of what people consider relevant, and results from the 
Markov chain theory. 

— Commonsense relevance: When dealing with random phenomena, people 

generally consider observations relevant for limited durations. For example, 

^ This can be done concisely using a Bayesian network. 
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observing the cat lying on the living room couch in the morning is irrelevant 
when trying to locate the cat few hours later. 

— Markov chain theory: Regular Markov chains converge to a stationary dis- 
tribution which does not depend on the initial state. Theoretically, it takes 
infinite time to reach the stationary distribution but an upper bound on the 
difference in probabilities due to initial state is given by (1 — 2e)‘ where e 
is the smallest transition probability, and t is the time [^. Recent work on 
convergence time shows that the convergence time T can be bounded by 
RT.A* where A is the absolute value of the second highest among the eigen 
values of the transition matrix. The convergence of Markov chains exhibits 
cutoff behaviour |^, after an initial period of seemingly little change, the 
probabilities converge quickly to the steady state. 

Time-sliced Bayesian networks are the most common technique for proba- 
bilistic temporal reasoning. These networks represent time discretely and create 
an instance of each time-varying random variable for each point in time. The 
arcs connecting two instantiations are sometimes called temporal arcs and are 
responsible for propagating temporal effects such as causation and persistence 
across time. There is no consensus as to when a new instantiation is needed. 
DHugin networks add new instantiations at equal intervals |2j. Depending on 
the chosen temporal resolution, the network may contain a number of instanti- 
ations n = T j A where T is the total duration and A is the temporal resolution. 
The computational complexity of inference in Bayesian Nets is NP-hard [T]. 
Reducing the size of the network invariably results in significant performance 
improvements 

1.1 Weaker Relevance Criterion 

In dynamic uncertain domains, we identify a class of irrelevance due to the 
uncertain dynamic nature of change. The relevance in such domains degenerates 
with time. Unfortunately, the definition of relevance presented above does not 
capture this class of information irrelevance. For example, it is not possible 
to claim that observing the cat on the couch is independent from its present 
location. There is a slim chance that the cat remained on the couch all this time. 
The definition of relevance as probabilistic dependence would not allow relevance 
to fade away after a finite duration. A relevance duration based on the strict 
definition of conditional dependence will always be infinite. In order to bridge 
the gap between our common-sense notion of irrelevance and the definition, we 
define a weaker temporal relevance criterion. 

If the maximum change that an assertion 0j G 0 at time tj can induce on 
the probability of belief qi at time U is less than a small value S, then ti and tj 
are temporally extraneous with respect to qi. 

Definition For a binary variable (or a conjunction of binary variables) 9 and 
a belief qi, the degree of relevance 5 of 0 at time tj with respect to qi at time ti, can 
be defined as the smallest 6 that satisfies the inequality: \P{qii \0j) — P{qii \^9j)\ < 
5 . 
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The above inequality does not apply if ^6j contains one or more disjunctions 
or if the variables in 9 are not binary. A problem known as the disjunctive factors 
problem manifests itself each time ^9j is a disjunction [^. Consider that we are 
interested in evaluating the relevance of a certain dose of a new treatment C to 
the recovery R from a certain disease. The doses are 0 mg, 250 mg, and 500 mg. 
The probabilities P{C = 0), P{C = 250) and P{C = 500) are all equal (each 
is equal to 1/3). It has been experimentally determined that the probability of 
recovery depends on the dose as follows P{R\C = 0) = 0.2, P{R\C = 250) = 0.7 
and P{R\C = 500) = 0.9. To evaluate the relevance of the higher dose of 500 mg, 
we compare P{R\C = 500) — P{R\^{C = 500)). This probability turns out to be 
0.45 because P(i?h(C = 500)) = (P(C = Q)P{R\C = 0) + P(C = 250)P(i?|C = 
250))/(P(C = 0) + P(C = 250)). This value indicates that the 500mgi is relevant 
to recovery (i.e. a patient is more likely to recover taking this dose than randomly 
choosing between the other two alternatives). This result can be easily note by 
pairwise comparison the conditional probabilities. We modify the definition of 
relevance to use pairwise comparisons as follows. 

Definition The degree of relevance of factor 9j with respect to query qi is 
5 iff for all possible assignments of 9j the maximum change in the probability 
P{qi\9j) is less than 5. 

The strength of the degree of relevance changes according to the value of i5. A 
5 value of zero results in the well known strong irrelevance notion of probabilistic 
independence. Irrelevance is weaker for higher 5 settings. Weak temporal rele- 
vance corresponding to reasonably small 5 values is of particular interest here. 
We redefine irrelevance using extraneousness instead of independence. The new 
definition allows us to ignore weak relevance. 

Definition The theory 0 can be divided into a relevant subset Oq and 
an extraneous subset Oe- Oq answers the query Q with accuracy 5 iff for any 
conjunction (possibly singleton) q C Q \P{q\0) — P{q\0Q) < S. 

Starting with the complete theory 0, we would like to identify irrelevant 
and weakly relevant information. This information can then be discarded and 
reasoning can proceed with the concise theory 0q. Here, we concentrate on the 
degeneration of the relevance of information as time evolves. Past information be- 
comes irrelevant after a time duration T. This duration depends on the dynamic 
nature of the process, the required accuracy <5 and the probability distribution 
describing the process. 



2 Relevance Time 

Before proceeding any further, it may be useful to examine more closely the 
performance savings that can be achieved by discarding irrelevant information. 
To this end, we start by considering the following example. 

The cat is seen in the living room at 9:00 AM. We are interested in evaluating 
the probability that the cat is in the room at 2:00 PM. The cat may leave the 
room during any minute with probability P(leave\inside) = 0.00579 and it 
may enter the room during any minute with probability P{enter\outside) = 
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0.00773. Assuming a one minute resolution in our example, we can expect the 
network to have 300 instantiationtH. The next subsection presents some results 
which imply that by ignoring the earlier, weakly relevant observation, substantial 
savings in computation can be achieved for a very small error(less than 0.001) 
in probabilities. 

2.1 Relevance in Time-Sliced Belief Networks 

An analytical proof m obtained by solving the recurrence describing the be- 
haviour of a single variable system shows that there exists a duration T such 
that the probability of fluent (time varying fact or state) / at a time t > to + T 
changes by at most 6 depending on the truth of / at to. Moreover, for any 
arbitrarily small S, a different duration T can be found by substitution in 



where T is smallest time such that |7^(/t|/o) — -P(/tH/o)| <= S, p\ is the 
probability that f changes from true to false, and p 2 is the probability of a 
change from false to true. 

The extensive literature on the convergence of Markov chains to a station- 
ary distribution provides upper limits on the convergence time because exact 
solutions for more than two states requires solving more complex recurrences. 
Moreover, the rather sudden and fast convergence of some chains after a certain 
time known as the cutoff phenomenon Pj is not well understood. This study 
therefore adopts a more heuristic and rather empirical approach to the problem. 



1 1 — Pi — P 2 1 represents the absolute difference between the first and second row 
values in either column of the matrix. Thus, this quantity could be regarded 
as measuring the difference between the probability vectors for the two states. 
Intuitively, this seems reasonable; if the probabilities for the next stage are ap- 
proximately the same for all states then the series of probabilities for different 
starting states will quickly look the same, indicating an early convergence. Con- 
versely, if the various states result in wildly different probabilities for the next 
stage, the initial state will strongly determine the probabilities for some time. 

We therefore sought a measure of the difference between probability vectors 
which would be meaningful for more than two states, in the hope that using this 
measure in the above formula would provide a predictor or at least an upper 
bound for relevance times. We wanted this measure to satisfy two constraints: 

— It should reduce in the two-state case to the formula above 

— It should range from 0 to 1, to ensure meaningful results. A value of 0 will 
cause a prediction of 0 relevance time and consequently should occur only 

^ Other approaches may generate networks with only two instantiations depending on 
the criteria used for adding instantiations. 



In S 



ln|l -pi -p 2 \ 



In the transition matrix for the two-state case: 



1-pi Pi 
P2 1 - P2 
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when all probability vectors are identical; a value of 1 will correspond to a 
prediction of infinite relevance time and consequently should occur when a 
system’s evolution is deterministic. 



After trying several simple measures based on the Euclidean distance between 
vectors, our most promising results came from a more straightforward measure 
which defines the distance d between two row vectors as the maximum absolute 
difference between any two elements in the same column of the two vectors. 
A matrix is formed such that element i,j in the matrix corresponds to the 
conditional probabilities P{X = Xi\Vj) where A is a network variable, Xi is a 
possible value for X, and Vj denotes a valid configuration for the predecessors 
of X. The maximum difference between any two row vectors in the network 
is the distance we are looking for. Note that this measure is easy to calculate. 
The computation time is linear in the number of matrix entries. The expected 
convergence time is then by 



T = 



\ 



ln((5) 

ln(d) 



3 Implementation and Algorithm 

To validate the relevance measure proposed above, a series of experiments has 
been conducted. Using some synthetic and published time-sliced belief networks, 
and randomly generated probabilities, we propagated evidence at the first time- 
slice across the network until convergence. The main measured quantity in these 
experiments is the time to convergence. This time is then compared to the con- 
vergence time predicted by our predictor. 

To speed up the experiments and detect the convergence as early as possible, 
the propagation utilized a modified version of the roll-up algorithm m- The 
algorithm performs three operations: prediction, roll-up, and estimation. This 
algorithm maintains two slices of the network and uses the values of the beliefs 
at time t — 1 to evaluate the beliefs at time t. The calculated beliefs are then 
shifted to the original slice to perform another propagation and so on. This 
algorithm however has the following limitations: 

— Common causes are not handled properly because there is no backward 
propagation phase to update the probability of a cause in an earlier time slice. 
In fact the presence of common causes makes all the time slices correlated, 
and more elaborate techniques would be needed IHHT3] 

— The propagation using two slices does not work correctly if temporal arcs 
connect dependent nodes. Some belief may appear independent when only 
two time slices are used even though they are dependent in the full network. 

For this particular study, evidence is only available at the first time slice which 
means that the first of these two problems does not affect the propagation. To 
address the second problem, we modified the roll-up algorithm such that each 
node in the first time slice keeps all incident edges. 
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For each node in the first time slice, a probability vector is calculated accord- 
ing to the rule of eliminatior[^(by asserting every possible combination of states 
of that node’s parents; and find the probability weighted sum) . This allowed us 
to adjust the probability tables of all the nodes in the first time slice to reflect 
their dependencies. 



4 Results of the Empirical Study 

The first set of experiments considered belief networks with a single node but 
instead of assuming binary variables, we allowed the variables to take up to ten 
distinct values. 

In the following discussion, the proposed predictor is considered to have un- 
derestimated the relevance time if the predicted time is shorter than the actual 
time. The predictor overestimates if the relevance time predicted is longer than 
the observed time. 

Results for the three- valued case shown in Figure 1 were good: more than five 
thousand random tests failed to produce any cases where the formula predicted a 
cutoff time earlier than observed. Thus, while no rigorous proof could be found, 
it seems reasonable to suggest that the formula may provide an actual upper 
bound for this case. 

Unfortunately, when more values are allowed, the formula under-predicted in 
some cases. However, under-predictions occurred for well under 2% of all matrices 
tested. Thus, the formula might still be useful for some practical applications 
where absolute certainty is not required, particularly if a ‘slush factor’ is added 
to the formula’s results (almost all under-predictions were off by 5 or less). Figure 
2 displays the results obtained when five values are allowed. 

Table 1 summarizes the results obtained from two thousand experiments 
using randomly generated probabilities for each of the 3, 4, 5, and 10-valued 
single variable belief networks. Table 2 reports the worst observed results. 



Values 


Number 

Underestimated 


Average Absolute 
Overestimation 


Average Relative 
Overestimation 


Average Absolute 
U nderestimation 


3 


0 


5.87 


70.32% 


- 


4 


26 


6.216 


75.63% 


-2.077 


5 


46 


5.102 


65.44% 


-1.565 


10 


26 


2.262 


35.03% 


-1.038 



Table 1. Average performance of the proposed predictor for multiple- valued 
single node networks with S = 0.001 



Virtually all practical applications of course use more complex multi-node 
networks. To study the behavior of these networks we used some synthetic net- 
works and some networks from the literature. For these networks we use the 

® Also known as the total probability theorem P{A) = EiP(Bi)P(A\Bi). 
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Fig. 1. Three valued network: Actual and predicted relevance time versus dis- 
tance measure 



Values 


Average 

Under- 

estimation 


Relative Worst 
Overestimation 


Absolute Worst 
Overestimation 


Relative Worst 
Absolute 
U nderestimation 


Worst Relative 
U nderestimation 


3 


- 


157 


2433.33% 


- 


- 


4 


-14.58% 


167 


1720.00% 


-9 


-30.00% 


5 


-12.62% 


54 


540.00% 


-9 


-23.68% 


10 


-13.03% 


22 


314.29% 


-2 


-18.18% 



Table 2. Worst case performance of the proposed predictor for multiple- valued 
single node networks with S = 0.001 




Fig. 2. Five valued network: Actual and predicted relevance time versus distance 



measure 
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largest d measure in all the nodes as the distance used in estimating the conver- 
gence time. 

To evaluate the relevance time in a multi-node network, all combinations of 
states for observable variables are used as possible starting observations. The 
relevance time is the time required to be with 6 from the stationary distribution 
(i.e., initial conditions become irrelevant). The experiments were conducted using 
synthetic networks randomly generated and with evidence set at a different node 
each time. The results in Tables 3 and 4 correspond to a 5 value of 0.001. The 
results for the following three networks are given: a simple two variable network, 
where each variable depends on the two variables in the previous time step, 
and the drowning network [lOj and the car start network [^. For each network, 
two thousand experiments were conducted with consistent random probability 
distributions. 




0 0.2 0.4 0.6 0.8 1 

Distance Measure 



Fig. 3. Car start network: Actual and predicted relevance time versus distance 
measure 



Network 


Number 

Underestimated 


Average Absolute 
Overestimation 


Average Relative 
Overestimation 


Average Absolute 
U nderestimation 


Simple 


12 


39.864 


408.34% 


-4.083 


Drown 


0 


60.047 


904.25% 


- 


Car start 


0 


68.700 


991.86% 


- 



Table 3. Average performance of the proposed predictor for multiple node net- 
works with 6 = 0.001 



The fact that the formula never underestimated for both ‘realistic’ networks 
seems promising at first glance. However, the average overestimation values are 
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Values 


Average 

Under- 

estimation 


Relative Worst 
Overestimation 


Absolute Worst 
Overestimation 


Relative Worst 
Absolute 
U nderestimation 


Worst Relative 
Under- 
estimation 


Simple 


-15.21% 


20546 


205460.00% 


-18 


-36.73% 


Drown 


- 


6459 


107650.00% 


- 


- 


CarStart 


- 


4036 


134533.33% 


- 


- 



Table 4. Worst case performance of the proposed predictor for multiple node 
networks with S = 0.001 



rather high. The proposed predictor seems to provide a good upper bound but 
a tighter bound is more desirable. 



Overestimated by 


Drowning 


Car Starting 


Overestimated by 


Drowning 


Car Starting 


0 


82 


0 


1 


3 


0 


2 


12 


0 


3 


12 


1 


4 


13 


1 


5 


13 


6 


6 to 10 


90 


77 


11 to 15 


158 


147 


16 to 20 


186 


182 


21 to 25 


196 


187 


26 to 50 


613 


604 


51 to 75 


260 


312 


76 to 100 


117 


146 


101 to 125 


66 


101 


126 to 150 


51 


73 


151 to 175 


30 


39 


176 to 200 


14 


19 


201 to 400 


64 


74 


401 to 600 


10 


20 


601 to 800 


3 


7 


801 to 1000 


0 


1 


1001 to 2000 


5 


2 


2001 to 3000 


1 


0 


3001 to 4000 


0 


0 


4001 to 5000 


0 


1 


5001 or more 


1 


0 



Table 5. Error histogram 



The histogram of the error values in Table 5 shows that for the car starting 
network the predictor overestimated by several hundreds in many cases. The 
numbers for the drowning network are more promising, however; the fact that 
the formula exactly predicts relevance time reasonably of ten without ever under- 
predicting suggests that the formula provides a useful upper bound for this 
network. 

The different behavior of the two networks suggests that the architecture of 
the network plays an important role in the rate of degeneration of information 
relevance. Further experimentation is needed to determine how the structure of 
the network affects the rate of convergence. 

5 Conclusions and Future Research 



In this paper we examined the degeneration of relevance in several time-sliced 
belief networks. Overall, we have found a fast rate of convergence to a stationary 
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distribution. Here we summarize the key results, identify their limitations and 
implications. 

A simple convergence predictor based on measuring how dissimilar the prob- 
abilities are, provides a useful upper bound for relevance time. The study of this 
upper bound indicates that further refinements are needed. The effect of network 
structure on the relevance time is still unclear. A tighter bound would be more 
useful in applications. The upper bound presented here has performed fairly well 
and has rarely been exceeded but by a small margin. 

The actual relevance time has rarely exceeded 100 time slices, which is sur- 
prising given that we have conducted more than 12,000 experiments using ran- 
domly generated probabilities. This observation is particularly promising be- 
cause it implies that many applications can benefit from discarding weak rele- 
vance. 
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Abstract. Naive Bayes is an efficient and effective learning algorithm, 
but previous results show that its representation ability is severely lim- 
ited since it can only represent certain linearly separable functions in the 
binary domain. We give necessary and sufficient conditions on linearly 
separable functions in the binary domain to be learnable by Naive Bayes 
under uniform representation. We then show that the learnability (and 
error rates) of Naive Bayes can be affected dramatically by sampling dis- 
tributions. Our results help us to gain a much deeper understanding of 
this seemingly simple, yet powerful learning algorithm. 



1 Introduction 

Naive Bayes is one of the most efficient and effective inductive learning algo- 
rithms for machine learning and data mining. It is time efficient, for its time 
complexity is only linear of the training data. It is space efficient since, after dis- 
cretization, it builds up a frequency table in size of the product of the number 
of attributes, number of class values, and the number of values per attribute. It 
does not need to store the training data in memory when it builds the frequency 
table, but just scans the data once from the disk. 

Naive Bayes is also surprisingly effective in the classification task. Many 
empirical comparisons between Naive Bayes and modern decision tree algorithms 
such as C4.5 (Quinlan, 1993) showed that Naive Bayes predicts equally well as 
C4.5 (see, for example, Langley, Iba & Thomas, 1992; Kononenko, 1990; Pazzani, 
Muramatsu & Billsus, 1996). The good performance of Naive Bayes is surprising 
because it makes an unrealistic assumption that is almost always violated in 
real-world applications: given class values, all attributes are independent. 

Domingos and Pazzani (1997) presented an important work explaining the 
good predictive performance of Naive Bayes. Basically, even though the assump- 
tion that Naive Bayes makes alters the probability distribution of the class label, 
the class with the maximum probability can still be the same. That is, under 
the MAP (Maximum A Posteriori) Principle, the 0/1 loss of Naive Bayes can be 
very small (Domingos & Pazzani, 1997). 

However, Naive Bayes has a strong limitation in its representation ability, 
since it is well-known that Naive Bayes can represent only linearly separable 
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functions in the binary domain (Duda & Hart, 1973). Further, Naive Bayes can- 
not represent all linearly separable functions. As Domingos and Pazzani (1997) 
showed, several specific linear functions are not learnable with Naive Bayes under 
uniform distribution. 

We extend (Domingos & Pazzani, 1997)’s work further in this paper. We 
derive necessary and sufficient conditions for linearly separable functions to be 
learnable by Naive Bayes under uniform distribution, and we verify our results 
with empirical experiments. We then show, surprisingly, that the learnability 
and error rates of Naive Bayes on linearly separable functions can be affected 
dramatically by the sampling distribution: learnable functions under uniform 
distribution can become unlearnable under other distributions, and vice versa. 

Our results help us to gain a deeper understanding of this seemingly simple 
yet powerful learning algorithm. Naive Bayes’ learning capacities are determined 
not only by the target function, but also by the sampling distribution, and by 
how attribute values are represented. 

2 Naive Bayes Basics 

We start with the Boolean classification problem in the binary domain. Assume 
X\, A 2 ,..., Xn are n binary Boolean variables taking values 1 or 0. For conve- 
nience, we use Xi to represent the value that Xi takes. Thus, an example E can 
be represented by a vector {x\,X 2 , , ...,a;„). Let <F, called the example space, be 
a set of all possible (2”) examples. We use C to represent the target (classifica- 
tion) variable, which also takes values 1 or 0 (again, we use use c to represent 
the value C takes). 

According to the Bayesian Theorem, the probability of an example E = 
{x\,X 2 , , ..., Xn) being class c is 

/ p{xi,X 2 ,.:,Xn\c)p{c) 

P{c\E) = r 

p[Xi,X2, -.jXn) 

The prediction which Bayes makes is taken as the class value with largest prob- 
ability. That is, E belongs to the class C = 1 iff 

.p, ^ P{C = l\xi,X2,:;Xn) 

>- p{C=0\xuX2,...,Xn)- 

where g{E) is called the discriminant function. 

Naive Bayes assumes variable independence given class values. That is: 

p{xi,X2, ■.;Xn\c) = p{xi\c)p{x2\c)...p{x„\c) 

The corresponding discriminant function g{E) of Naive Bayes is then: 

(F\ = ^ = ^)P{X2\C = l)...p{Xn\C = 1 ) 

> p{C = 0)p{x^\C = 0)p{x2\C = 0)...p{xn\C = 0) 

The value of p(xi\c) can be estimated from the training examples. 
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According to Duda and Hart (1973), Naive Bayes can only represent linearly 
separable (binary) functions. A linearly separable function is defined as E G C = 
1 iff gi{E) > 0, where 

n 

gi{E) - ao (2) 

i=l 

Assume pi and qi represent p{Xi = 1\C = 1) and p{Xi = 1\C = 0) re- 
spectively. When the logarithm is applied to Equation [I] we obtain a linear 
discriminant function, denoted by gy, as below (Duda & Hart, 1973): 

n 

gv{E) = ^ UiXi - oo (3) 

i=l 



where 



Gi = log 



k(i - g») 

{1-Pi)q^ 



ao = - 



(J^log 

2=1 



(1 - <^^) 



+ log 



p{c = i) 
p{c = 0) 



) 



( 4 ) 

( 5 ) 



Clearly, E G C = 1 iS g{E) > 1 iff gy{E) > 0. Thus, Naive Bayes can only 
represent linearly separable functions. 

However, Naive Bayes can not represent all linearly separable functions. A 
special case is that Naive Bayes can not learn some m-of-n concepts (Domingos 
& Pazzani,1997). An m-of-n concept is a Boolean concept that is true if m 
or more out of n Boolean variables are true. Clearly, it is a linearly separable 
function. Domingos & Pazzani (1997) showed that for the concept 8-of-25, when 
the input Boolean variables have just six or seven I’s, Naive Bayes gives an 
incorrect answer of 1 (instead of 0). Their result is based on the assumption 
that the example space E consists of all 2" examples of the 8-of-25 concept, or 
is under uniform sample distribution. 



3 Naive Bayes Under Uniform Distribution 

In this section, we assume that the sampling distribution is uniform. We start 
with a few necessary definitions: 

Definition 1: Two discriminant functions /i and /2 are said to be equal under 
zero-one loss if for every example E in the example space, fi{E) > 0 if only if 
/ 2 (E) > 0. We denote this fi = fi- 

Definition 2: A linearly separable function defined by the linear function gi is 
called Naive-Bayes linearly separable (NBLS), if for every example E in example 
space, gi{E) = gy{E). In short, we say this concept is NBLS. 

Let us consider a special class of linear functions gi 

n 

gi{E) = - ao 

2=1 



( 6 ) 
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That is, Ui = a for all 1 < f < n. Let x denote X^r=i which is the number of 
cci = 1 in the function. We have gi{E) = gi{x) = ax — ao- We consider a,ao > 0 
without losing generality. 

For this kind of linearly separable functions, it is obvious that pi = P 2 = 
.... = Pn and qi = q 2 = .... = qn under uniform distribution, where pi and qi 
are defined earlier. Assume p = pi, and q = qi. It can be shown easily that 
p > 0.5, and q < 0.50 Then for Naive Bayes, the discriminant function can be 
represented as: 

gy{E) = gy{x) = a'x - a'g (7) 



where 



a' = log 



p(i - q) 
(1 -P)9 



( 8 ) 



= + (9) 

Clearly, we have a' > 0. We have the following theorem: 

Theorem 1: For binary functions defined by Equation [6l gv{x) = gi{x) if only 
if a; > max{^, or x < min{^, ^), where 0 < x < n, a, ao, a', and ag are 
given by Equation E] IS] and [3 

Proof: 

Suppose X > max{^, or a; < min{^, ^), then whenever gi{x) > 0, gv{x) > 

0, and whenever gi{x) < 0, < 0 too. That is, under zero-one loss, gi.{x) is 

equal to g„(a:). 

Suppose 

gv{x) = gi{x) 



then gi{x) > 0 whenever gv(x) > 0, and gi{x) < 0 whenever gv(x) < 0. That is, 
X > — and X > or x < — and x < El, 

— a — a' ’ a a' 

So, we have x > maxi— El\ or a: < mini^, Ej.y 

’ — ^ a ^ a' '' ^ a ^ a' ' 

Essentially, Theorem 1 tells us that for any Boolean function defined by 
linear equation if there exists some integer j in the following left-closed and 
right-open interval: 



r • /®0 Og.^ ,a0 Og,, 

a a' a a' 



then for any examples with x = ^Xi = j, Naive Bayes will make an incorrect 
prediction; for all other examples. Naive Bayes agrees with the target function 
values. 

The Boolean functions defined by Equation [6] are essentially the m-of-n con- 
cepts where a = 1 and ag = m. The discriminant function of the m-oi-n concept, 
denoted by gm-n, is thus: 



gra-n{E) = gm-n{x) =X~m 



^ Under uniform distribution, p{Xi = 1) = 0.5. Since some Xi must be 1 for C = 1, 
p = p{Xi = 1\C = 1) > 0.5, and q = p{Xi = 1\C = 0) < 0.5. 
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Applying Theorem 1 to m-of-n concept, we have the following corollary. 
Corollary 1: Naive Bayes discriminant function gv{x) = gm-n{x) on an ar- 
bitrary m-of-n concept if only if a; > max{m, or a; < nnin{m, ^), where 
0 < a; < n, a, oq, o', and Oq are given by EquationEl ElandEl The corresponding 
probabilities can now be obtained explicitly (Domingos & Pazzani, 1997): 



p{C = 1) 



p{C = 0) 




p = p{xi = 1|C = 1) 



q = p{xi = 1|C = 0) 




Corollary 1 presents necessary and sufficient condition for training exam- 
ples of the m-of-n concept to be equivalent to the target functions. The next 
theorem tells us exactly what kinds of m-of-n concepts are NBLS; that is, 
gy{x) = gm-n{x) for all x. 

Theorem 2: An m-of-n concept is NBLS, if only if its corresponding Naive 
Bayes discriminant function Equation |7] satisfies the following condition: 



e (to 

a' 



1 , to ] 



Proof: Let N={l,2,...,n},Z\ represents a condition. Sn{A) denotes a set {y\y G 
N and y satisfies A}. Suppose ^ G (to — 1,to]. 

Then max{m, = to, min{m, So 



Sn{x > max{m, 






Sn{x > to) 



{to, to -|- 1, ..., n} 



Qj 

Sn(x < min(m, -^)) = Sn(x < m) = (to — 1,to — 2,...,!} 
a' 

Sn(x > maxim, — )) I I Sn(x < minim, — ) = N. 

a' ^ o' 

According to Corollary 1, for any x £ N, gv{x) = gm-n{x). That is, the TO-of-n 
concept is NBLS. 
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Suppose a m-of-n concept is NBLS. Then for any x G N, gv{x) = gm-n{x). 
There are three cases as below: 



( 1 ) 

( 2 ) 

( 2 ) 



m — 



m 



m y 



> 1 
= 1 
< 1 



In the first case, there is at least one integer j G N such that m < j < ^ or 
^ <j <m. From Corollary 1, g„(j) not = g^-nij)- 
In the second case, ^ = m — 1 or ^ = m + 1- 

’a' a' 

For the former one, we have gvijn — I) not = gm-n{ixi — !)• 

For the latter one, we have gvijn) not = gm-nimi). 

In the third case, ^ G {m — I, m] or ^ G [m, to + I) 

If ^ G [to, to + 1), maxim, minim, = to. 

It is easy to show gvim) not = gm-nim). Therefore ^ G (to — I, to]. 

Since ^ is a function of only to and n. Theorem 2 gives us a necessary 
and sufficient condition on to and n for any m-oi-n concept to be representable 
by Naive Bayes under uniform distribution. Let denote a set of to such 
that the corresponding m-oi-n concepts are NBLS. Here are several examples: 
L24 = {12,13}, L25 = {13}, Leo = {30,31}, Lgi = {31}, L134 = {67,68}, and 
7^135 = {68}. From the above results, we have the following: 

Conjecture: Any m-oi-n concept is NBLS if and only if to = ^ or ^ + 1 when 
n is even, and to = when n is odd. 

We have verified this conjecture to be true for 6 < n < 1322, but we have not 
been able to formally prove it. Nevertheless, it is clear that only a small portion 
of the m-oi-n concepts is NBLS. 



4 Sampling Distribution 

The above discussions on the representation ability of Naive Bayes are based 
on an assumption that the sampling distribution is uniform. One might tend 
to think that such learnability results can be naturally extended to other sam- 
pling distributions. After all, the representation power of most other learning 
algorithms, such as decision tree and neural network learning algorithms, is not 
affected by the sampling distribution of the training examples. Perceptions, for 
example, can always represent any linearly separable functions no matter how 
the training (and testing) sets are sampled. 

Surprisingly, this is not true for Naive Bayes. 

When the sampling distribution is changed, the probabilities pt, qi and p(C = 
l)/p(C = 0) in Equation l4l5l can be changed, causing the discrimination function 
of Naive Bayes to be different from the one under uniform distribution. This may 
alter the learnability (and the error rate) of Naive Bayes. 

A simple way to see the effect of various sampling distributions on Naive 
Bayes is to use what we call the ” two-tiered” sampling: uniform sampling is 
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still applied on positive and negative examples respectively, but with various 
ratios (under uniform distribution, the ratio is fixed and is determined by the 
target function). For example, positive examples may be sampled 10 (or any 
number of) times more likely than the negative examples. That way, all of the 
Pi and Qi are kept the same as the ones under uniform distribution, but the 
term p{C = l)/p{C = 0) can be changed to arbitrary values between (0,oo). 
This allows us to “manipulate” the constant term oq in the linear discriminant 
function (Equation alone to study the learnability of Naive Bayes. 

Figure [1] shows the error rate under uniform distribution on three m-of-25 
concepts when p{C = 1) ranges from 0.01 to 0.99 (andp(C = 0) = l—p{C = 1)). 
The small dark dot on each curve represents the result of that concept under 
uniform distribution. 




0.6 



P(C) 

Fig. 1. Error ratio versus p{C) for m-of-n concepts. 



8-25 

18-25 

13-25 




8-25: m=8, n=25; 18-25: m=18, n=25; 13-25: m=13, n=25. 

• : the point on uniform sampling. 



Several interesting observations are obtained. First of all, as we discussed ear- 
lier, 13-of-25 concept is NBLS under uniform distribution, and we can verify this 
from the figure: the dot representing uniform distribution on its curve has an er- 
ror rate of 0. But 8-of-25 and 18-of-25 are not NBLS under uniform distribution, 
and their dots indicate an error rate of about 8%. Second, NBLS functions under 
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uniform distribution can become non-NBLS under other distributions. For ex- 
ample, 13-of-25, which is NBLS under uniform distribution, becomes non-NBLS 
when p(C = 1) is greater than 0.59 or less than 0.42. In fact, the error rate of 
13-of-25 can become quite large (as high as 50%; not shown in the figure) under 
extreme distributions when p{C = 1) is approaching 0 or 1. Third, non-NBLS 
functions under uniform distribution (such as 8-of-25 and 18-of-25) can become 
NBLS if the sampling distribution changes. As we can see from their curves, 
not far from their respective sampling rate of positive examples p(C = 1) under 
uniform distribution, the curves reach 0% error rate under other sampling rates. 

Figure 3 is similar to Figure 1, we can see clearly again that some linear 
function (such as m,n,ao = 19,25,0) that is NBLS under uniform distribution 
is non-NBLS under other distributions, and vice versa. 




19-25-0 

-- 15-25-0 
19-25-5 



P(C) 

Fig. 2. Error ratio versus p{C) for example 1. 

19-25-0: m=19, n=25, ao=0; 15-25-0: m=15, n=25, ao=0; 
19-25-5: m=19, n=25, ao=5; •: the point on uniform sampling. 



The error curve for linearly separable function ^ ixi — oq > 0 (Example 2 
in Section 0 is plotted in Figure 2. It is very interesting to observe that under 
uniform distribution, the functions are not NBLS. When p{C = 1) changes, the 
error rates may become very small (1.8%) to very large (75.1%), but never 0%. 
Of course it does not mean that this class of functions are not NBLS under any 
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distribution, since changing p(C = 1) is just one way of changing the sampling 
distribution. It is unknown to us if these functions can become NBLS under 
some other distributions. 




P(C) 

Fig. 3. Error ratio versus p{C = 1) for ^ ixi — cq > 0. 



a0=20 

a0=30 

a0=40 



• : the point on uniform sampling. 



Nevertheless, Naive Bayes cannot learn any non-linearly separable Boolean 
functions in the binary domain under any non-zero sampling distributiorj^. This 
is simply because no matter how we alter the sampling distribution, the discrim- 
inating function of Naive Bayes is always a linear function. 

5 Conclusions 

Naive Bayes is an efficient and effective learning algorithm. But previous results 
show that its representation ability is severely limited since it cannot repre- 
sent any non-linearly separable functions in the binary domain. Further, it can- 
not represent all linearly separable functions, as several counter examples given 
in (Domingos & Pazzani,1997). In this paper we extend Domingos & Pazzani 

^ Non-zero distributions guarantee a non-zero sampling probability on each example 
in the example space. 
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(1997)’s results, and give necessary and sufficient conditions on linearly separa- 
ble functions in the binary domain to be learnable by Naive Bayes under uniform 
representation. We demonstrate further that the learnability and error rates of 
Naive Bayes can be affected dramatically by the sampling distribution of the 
training examples. NBLS functions can become non-NBLS, and vice versa. 

Our results help us to gain a much deeper understanding of this seemingly 
simple but powerful learning algorithm. Naive Bayes’ learning capacities and 
error rates are not only determined by the target function, but also by the 
sampling distribution, and how the attributes are represented. 
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Abstract. The use of statistical techniques has revolutionized the held 
of computational linguistics over the last ten years. In this talk we review 
the state of the art in programs that can assign semantic annotations 
to general text. These programs typically start by producing a parse 
tree for the sentences of the text. There are now several programs that 
can produce reasonably accurate complete parses for naturally occurring 
English (e.g., all of the sentences on the front page of today’s New York 
Times). 

We then look at programs that try to extract more of the meaning from 
the sentence. In particular we consider two aspects of a good meaning 
representation: predicate-argument structure (who did what to whom) 
and noun-phrase coreference. While the syntactic structures produced 
by the aforementioned parsers go a long way to establishing predicate- 
argument structure, they fail in in several ways, and we look at programs 
that fix two of these problems. The first of these programs adds so-called 
“grammatical/function tags” to the parse, and the second adds trace 
(empty) elements to the parse. 

Finally we discuss the current state of the art in statistical noun-phrase 
coreference programs. While we now have reasonably accurate programs 
for pronoun-coreference (e.g., the relation between “George W. Bush Jr.” 
and the pronoun “he”), full-noun-phrase coreference (e.g., the relation 
between “George W. Bush Jr.” and “Governor Bush”) seems to be much 
harder — at least our accuracy is currently much lower. 

We have been able to put all of these programs together to automatically 
syntactically and semantically annotate large quantities of text. 
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Abstract. After two decades of research on automated discovery (1995; 
1998; 1999; Chaudhuri & Madigan, 1999; Edwards, 1993; Komorowski 
& Zytkow, 1998; Langley, Simon, Bradshaw, & Zytkow, 1987; Piatetsky- 
Shapiro & Frawley, 1991; Shen, 1993; Shrager & Langley; Simon, Valdes- 
Perez & Sleeman, 1997; Zytkow, 1992, 1993, 1997), it is worthwhile to 
summarize the foundations for discovery systems by a set of principles. 
We propose a number of such principles and we discuss the ways in 
which different principles can be used together to explain discovery sys- 
tems and guide their construction. Automated discovery is closely linked 
to natural sciences, logic, philosophy of science and theory of knowledge, 
artificial intelligence, statistics, and machine learning. Knowledge discov- 
ery tools use a creative combination of knowledge and techniques from 
the contributing areas, adding its own extra value, which we emphasize 
in several principles. 



1 What Is a Discovery 

We start by clarifying the notion of discovery that applies to automated agents. 
A person who is first to propose and justify a new piece of knowledge K is con- 
sidered the discoverer of K. Being the first means acting autonomously, without 
reliance on external authority, because there was none at the time when the 
discovery has been made, or the discovery contradicted the accepted beliefs. 
Machine discoverers should be eventually held to the same standards. Abso- 
lute novelty is important, but a weaker criterion of novelty is useful in system 
construction: 

Agent A discovered knowledge if iff A acquired K without the use of 
any knowledge source that knows K. 

This definition calls for cognitive autonomy of agent A. It requires only that 
K is novel to the agent, but does not have to be found for the first time in the 
human history. The emphasis on autonomy is useful in machine discovery. Even 
though agent A discovered a piece of knowledge K which has been known to 
others, we can still consider that A discovered if , if A did not know if before 
making the discovery and was not guided towards if by any external authority. It 
is relatively easy to trace the external guidance received by a machine discoverer, 
as all details of software are available for inspection. 
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2 Principles of Autonomy 

Al: Autonomy of an agent is increased by each new method that 
overcomes some of the agent’s limitations. 

Admittedly, each machine discoverer is only autonomous to a degree. Its au- 
tonomy, however, can be increased by identifying the missing discovery capabili- 
ties and developing methods that supply them (Langley et.al, 1987; Nordhausen 
& Langley, 1993; Kulkarni & Simon, 1987; Kocabas & Langley, 1995; Valdes- 
Perez, 1993). The mere accumulation of new components, however, is not very 
effective. Each new component should be used in new creative ways, in combina- 
tion with the existing methods. As a result, more discovery steps in succession 
can be performed without external help, leading to greater autonomy: 

A2: Autonomy of an agent is increased by method integration, 
when new combinations of methods are introduced. 

Many methods use data to generate knowledge. When applied in sequence, 
elements of knowledge generated at the previous step become data for the next 
step. This perspective on knowledge as data for the next step towards improved 
knowledge is important for integration of many methods (Langley et al, 1987): 

A3: Each piece of discovered knowledge can be used as data for 
another step towards discovery: 

Step-1 Step-2 

Data-1 > Knowledge-1 = Data-2 > Knowledge-2 = Data-3 



3 Theory of Knowledge 

Knowledge of external world goes beyond data, even if data are the primary 
source of knowledge. It is important to understand elements of the formalism in 
relation to elements of the external world. Consider a fairly broad representation 
of a regularity (law, generalization): 

Pattern (relationship) P holds in the range R of situations. 

In practical applications this schema can be narrowed down, for instance: 

(1) if Pi(Ai)&...&Pfc(Afe) then Rel{A,B) 

where A, B,Ai,..., Ak are attributes that describe each in a class of objects, while 
Pi, ..., Pfc are predicates, such as Ai > 0 or A 2 = a. An even simpler schema: 

(2) if Pi(Ai)&...&Pfc(Afe) then C = c 

covers all rules sought as concept definitions in machine learning. 

A good fit between knowledge and data is important, but discoverer should 
know real-world objects and attributes, not merely data and formal hypotheses: 

/Cl: Seek objective knowledge about the real world, not knowledge 
about data. 




Automated Discovery: A Fusion of Multidisciplinary Principles 445 



This principle contrasts with a common data mining practice, when re- 
searchers focus entirely on data. Sometimes, however, specific knowledge about 
data is important, for instance about wrong data or data encoding schemas. 

Schemas such as (1) or (2) define vast, sometimes infinite, hypothesis spaces, 
so that hypotheses must be generated, often piece by piece, evaluated and re- 
tained or eliminated. 

/C2: [Principle of knowledge construction] All elements of each piece 
of knowledge are constructed and evaluated by a discovery system. 

Predictions are essential for hypothesis evaluation. It is doubtful that we 
would consider a particular statement a piece of knowledge about external world 
if it would not enable empirically verifiable predictions: 

/C3: A common characteristic of knowledge is its empirical contents, 
that is empirically verifiable predictions. 

Knowledge improvement can be measured by the increased empirical con- 
tents. Logical inference is used to draw empirically verifiable conclusions. The 
premises are typically general statements and some known facts, while conclu- 
sions are statements which predict new facts. Empirical contents can occurs in 
regularities (laws, statements, sentences), not in predicates which do not have 
truth value. Concepts, understood as predicates, have no empirical contents. We 
can define huge numbers of concepts, but that does not provide knowledge. The 
vast majority of knowledge goes beyond concept definitions: 

/C4: Each concept is an investment; it can be justified by regularities 
it allows to express. 

4 Principles of Search 

Discoverers explore the unknown and examine many possibilities which can be 
seen as dead ends from the perspective of the eventually accepted solutions, 
because they do not become components of the accepted solutions. This process 
is called search. We can conclude that: 

51: If you do not search, you do not discover. 

A simple search problem in AI can be defined by a set of initial states and a 
set of goal states in a space of states and moves. The task is to find a trajectory 
from an initial state to a goal state. In the domain of discovery the goal states 
are not known in advance, but the basic framework of discovery can be applied 
(Simon, 1979; Langley et al, 1987): 

52: [Herbert Simon 1] Discovery is problem solving. Each problem 
is defined by the initial state of knowledge, including data and by 
the goals. Solutions are generated by search mechanisms aimed at the 
goals. 




446 Jan M. Zytkow 



The initial state can be a set of data, while a goal state may be an equation 
that fits those data (Langley et al, 1987; Zembowicz & Zytkow, 1991; Dzeroski 
& Todorovski, 1993; Washio & Motoda, 1997). The search proceeds by construc- 
tion of terms, by their combinations into equations, by generation of numerical 
parameters in equations and by evaluation of completed equations. 

Search spaces should be sufficiently large, to provide solutions for many prob- 
lems. But simply enlarging the search space does not make an agent more cre- 
ative. It is easy to implement a program that enumerates all strings of characters. 
If enough time was available, it would produce all books, all data structures, all 
computer programs. But it produces a negligible proportion of valuable results 
and it cannot tell which are those valuable results. 

53: [Herbert Simon 2] A heuristic and data-driven search is an effi- 
cient and effective discovery tool. Data are transformed into plausible 
pieces of solutions. Partial solutions are evaluated and used to guide 
the search. 

Goal states are supposed to exceed the evaluation thresholds. Without that, 
even the best hypothesis reached in the discovery process can be insufficient. A 
discovery search may fail or take too much time and a discoverer should be able 
to change the goal and continue. 

54: [Recovery from failure] Each discovery step may fail and cog- 
nitive autonomy requires methods that recognize failure and decide 
on the next goal 

Search states can be generated in many orders. Search control, which handles 
the search at run-time, is an important discovery tool. 

55: [Simple-first] Order hypotheses by simplicity layers; try simpler 
hypotheses before more complex. 

The implementation is easy, since simpler hypotheses are constructed before 
more complex. Also, simpler hypotheses are usually more general, so they are 
tried before more complex, that is more specific hypotheses. If a simple hypoth- 
esis is sufficient, there is no need to make it more complex. 

Do not create the same hypothesis twice, but do not miss any: 

56: Make search non-redundant and exhaustive within each sim- 
plicity layer. 

5 Beyond Simple-Minded Tools 

The vast majority of data mining is performed with the use of single-minded 
tools. Those tools miss discovery opportunities if results do not belong to a par- 
ticular hypothesis space. They rarely consider the question whether the best fit 
hypothesis is good enough to be accepted and whether other forms of knowledge 
are more suitable for a given case. They ignore the following principle (Zembow- 
icz & Zytkow, 1996): 
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Ol: [Open-mindness] Knowledge should be discovered in the form 
that reflects real-world relationships, not one or another tool at hand. 

6 Statistics 

Equations and other forms of deterministic knowledge can be augmented with 
statistical distributions, for instance, y = f{x) + N{0,a{x)). N{0,a{x)) repre- 
sents Gaussian distribution of error, with mean value equal zero and standard 
deviation a(x). 

Most often a particular distribution is assumed rather than derived from 
data, because traditional statistical data mining operated on small samples and 
used visualization tools to stimulate human judgement. Currently, when large 
datasets are abundant and more data can be easily generated in automated 
experiments, we can argue for the verification of assumptions: 

ST ATI: Do not make assumptions and do not leave unverifled as- 
sumptions. 

For instance, when using the model y = f{x) + N{0,a{x)) verify Gaussian 
distribution of residua, with the use of runs test and other tests of normality. 
Publications in statistics notoriously start from “Let us assume that ...” Either 
use data to verify the assumptions, and when this is not possible, ask what is 
the risk or cost when the assumptions are not met. 

Another area which requires revision of traditional statistical thinking is test- 
ing hypothesis significance. Statistics asks how many real regularities are we 
willing to disregard (error of omission) and how many spurious regularities are 
we willing to accept (error of admission). In a given dataset, weak regularities 
cannot be distinguished from patterns that come from random distribution (the 
significance dilemma for a given regularity can be solved by acquisition of ad- 
ditional data). Automated discovery systems search massive hypothesis spaces 
with the use of statistical tests, which occasionally mistake a random fluctuation 
for a genuine regularity: 

STAT2: [Significance 1] Chose a significance threshold that enables 
middle ground between spurious regularities and weak but real regu- 
larities specific to a given hypothesis space. 

While a significance threshold should admit a small percent of spurious reg- 
ularities, it is sometimes difficult to compute the right threshold for a given 
search. Each threshold depends on the number of independent hypotheses and 
independent tests. When those numbers are difficult to estimate, experiments on 
random data can be helpful. We know that those data contain no regularities, 
so all detected regularities are spurious and should be rejected by the test of 
significance. We should set the threshold just about that level: 

STATS: [Significance 2] Use random data to determine the right 
values of significance thresholds for a given search mechanism. 
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